Effect size: Researchers have developed a metric called effect size for comparing results across studies that allows one to calculate the magnitude of a relationship between an intervention (e.g., the use of a software package) and an outcome (e.g., student test scores). Effect sizes are calculated for studies in which students from an intervention group and from a matched comparison group have been tested using the same measure of performance. The first step in calculating effect size is to calculate the difference between the average scores of students in the treatment or intervention group and the average scores of students in the comparison group. This result will determine whether the effect size is positive or negative. A positive score means the students who received the intervention outscored the students in the comparison group. The second step is to divide the difference in average scores by some measure of the variation in scores within the groups. Usually, the standard deviation of the comparison group is used as the denominator. One could say, then, that the effect size is the number of standard deviation units by which a treatment group outperforms a comparison group. Be aware that the design of a study may affect the size of reported effect sizes. Poorly designed studies may inflate the estimated effect size.
To understand the practical meaning of an effect size, it is important
to know how an effect size compares with more familiar metrics or with
results from other interventions. For example, an effect size of +1.0
is equivalent to about 15 points on an IQ test or 21 NCEs on the Stanford
Achievement Test. Sometimes, effect sizes are reported in terms of "months
of learning gain" or "grade equivalents." An effect size
of 0.1 is to approximately 1 month of learning gain. Effect sizes of educational
interventions are rarely as large as -1.0 or +1.0. For example, many educators
believe reducing class size is an effective way to improve student learning,
but effect sizes for studies of class size reduction are between +0.13
and +0.18. The practical meaning of the magnitude of an effect should
always be interpreted in context (Glass et al., 1981). The effectiveness
of a particular intervention should only be assessed in relationship to
other interventions trying to bring about the same effect (e.g, an improvement
in reading achievement or attitudes towards technology use). In addition
the relative cost to produce the effect must be considered. Effect sizes
as small as +0.10 may be of important practical significance if (1) the
intervention that produced the improvement is relatively inexpensive compared
to other competing options; (2) the effect is achieved across all groups
of students; and (3) the effect accumulates over time.
Differential sample attrition: Sample attrition refers to when members of the sample drop out during the course of a study. This becomes a problem if students representing a particular population characteristic (e.g., low-income) drop out more frequently from one group, treatment, or comparison, compared to the other. This type of sample attrition is often referred to as differential sample attrition. As a result, the groups are no longer similar on an important characteristic that other research has found to be related to how well students perform on tests. This may make it difficult to interpret the reason for differences in test scores between the treatment and comparison groups. The fact that the groups are comprised of different types of students by the end of the study may explain why one group ends up having higher test scores at the end of the study than the other. Otherwise, you may incorrectly assume that the difference in test scores was caused by the intervention.
Grade equivalent score: The grade
equivalent score is one of the most widely used, and confusing, metrics
to report student performance. It describes test performance in terms
of a grade level and the months since the beginning of the school year.
For instance, a GE of 4.5 (sometimes reported as 45) is the fifth month
of the fourth grade year. If a student in the third grade scores at the
seventh grade level on a test, it does not mean that the third grade student
is capable of doing seventh grade work. It means that the third grader
did as well as a seventh grader taking the third grade test.
Grade equivalent scores for individual students should never
be averaged to create a school average. You should not use grade equivalent
scores to compare student performance in different subject areas. Only
NCE scores should be used for these purposes.
Inferential statistics: One use of statistics is to be able to make inferences or judgments about a larger population based on the data collected from a small sample drawn from the population. Exit polling used during elections to determine how the population of voters voted is an example of the use of inferential statistics. A key component of inferential statistics is the calculation of statistical significance of a research finding.
Interrupted time series design: One limitation of pre-post design is that it does not take into account that students were on a particular learning trajectory before the treatment. One way researchers may try to improve on pre-test/post-test design is to determine pre-intervention trends in performance. Researchers may collect information on student or school performance for the several semesters or years prior to the arrival of the intervention. These pre-intervention trends are then compared to the trends in outcomes following the introduction of the technology program. Any differences observed in the trends may be associated with the effectiveness of the intervention.
Norm-referenced scores: (from http://www.uiowa.edu/~itp/bs-interpret.htm) A norm-referenced score involves comparing a student's score with the scores other students obtained on the same test. How much a student knows is determined by the student's standing or rank within the reference group. High standing is interpreted to mean the student knows a lot or is highly skilled, and low standing means the opposite. Obviously, the overall competence of the norm group affects the interpretation significantly. Ranking high in an unskilled group may represent lower absolute achievement than ranking low in an exceptional, high-performing group.
Normal Curve Equivalent scores (NCE): A
normal curve equivalent score is a type of norm-referenced
score. It differs from percentile rank
score in that it allows meaningful comparison between different test sections
within a test. For example, if a student receives NCE scores of 53 on
the Reading test and 45 on the Mathematics test, you can correctly say
that the Reading score is eight points higher than the Mathematics score.
NCEs are represented on a scale of 1 - 99. This scale coincides with a percentile rank scale at 1, 50, and 99. Unlike percentile rank scores, the interval between scores is equal. This means that you can average NCE scores to compare groups of students or schools.
Percentile rank: A percentile rank (PR) score is a type of norm referenced score. A PR score indicates the percentage of pupils in the reference or norm group whose scores for a test fell below a particular pupil's raw score. The reference group is usually selected by the publisher of the test to represent the average school in the district, state, or country. A student's PR score will change for different reference groups.
A percentile rank score of 45 means that the student who scored at the 45th percentile scored better than 45% of the students in the reference group. The intervals between percentile rank scores are not equal. The intervals are much closer together in the middle - between 40 and 60 - than they are out at the higher and lower end of the distribution. Thus it is easier for students who score in the middle of the range to improve in percentile rank than it is for students at the bottom or the top. For this reason, individual percentile rank scores should never be averaged to compute an average percentile rank for a school nor should be used to measure change in student performance. In addition, it is inappropriate to use percentile rank scores to compare student performance for different subject areas. Only normal curve equivalents should be used for these purposes.
Population: The group with a particular set of characteristics to which researchers attempt to generalize their findings from a smaller sample. These are the objects of generalizations for inferential statistics.
Probability samples: The ideal way to select a representative sample is to set up a selection procedure whereby each student in the population has some chance of being in the study. Such samples are called probability samples.
Raw score (RS): A raw score is the
number of items answered correctly for a test. These scores are used to
derive the other norm-related scores such as percentiles, standard scores,
and normal curve equivalents. A raw score by itself has little meaning.
They can not be used to compare student performance across different subject
areas or tests.
Sample: The sample is the group of people you select to be in your study. The sample should be representative of a larger population to whom you want to generalize.
Sample of convenience: Many researchers
select students, classrooms, or schools for their samples because they
are readily available. This type of sample is called a sample of convenience.
You should be aware that students or schools who volunteer for a study
or who are more readily available to the researchers may have certain
levels of characteristics -- such as ability, motivation, or attitudes
toward technology -- that make them a unique group. For this reason, use
caution when trying to generalize research findings based on a sample
of convenience to a larger population.
Scale scores: Scale scores are calculated
by applying sophisticated computational procedures directly to the pattern
of student responses to the items or questions on a test. Many tests use
this as the basic measure of the student's performance. Scale scores can
have different scale ranges such as college admissions tests like the
ACT (range of 0 to 36) and the SAT (range of 0 to 1600). Scale scores
are used primarily to provide a basis for deriving various other normative
scores to describe performance (like percentile ranks). Unlike percentile
rank scores, the interval between scores is equal. This means that you
can average scale scores to compare groups of students or schools.
Sampling error: The difference in
results for different samples of the same size is called sampling error.
All things being equal, by increasing the sample size, from say 25 to
125 students, the sampling error will be reduced (but not eliminated)
and the study findings can be assumed to be more reliable.
Statistical power: Statistical power
is the probability you will detect a meaningful difference, or effect,
if one were to occur. Ideally, studies should have power levels of 0.80
or higher -- an 80% chance or greater of finding an effect if one was
really there. The "power" of any individual study depends on
(1) the number of individuals in the study (sample size), (2) the
expected size of the improvement in student performance from using the
software (effect size); and (3) the precision of the measures used
to assess student performance. All else being equal, statistical power
increases with increasing sample size, larger expected effects, and more
precise measures. As a general rule of thumb, researchers concerned about
the "power" of their studies prefer to use sample sizes of at
least 30 individuals in both the software use and non-use groups. As a
general rule, larger numbers improve accuracy.
Statistical significance: All studies
using inferential statistics involve
the application of statistical significance -- the probability that a
result or outcome is larger or smaller than would be expected by chance
alone. Standard scientific practice usually assumes a probability value
(or p-value) of less than 1 in 20 (p < 0.05) for statistical significance.
Small values for p indicate that the differences between groups are probably
"true" differences and not due to happenstance. Large values
for p suggest that group difference may be due to chance, so that in reality,
there is a strong probability that no differences exist between the groups.
Please send comments and suggestions.
This site was created by the Center for Technology in Learning at SRI International under a task order from the Planning and Evaluation Service, U.S. Department of Education (DHHS Contract # 282-00-008-Task 3).
Last updated on: 11/04/02