Glossary

[top] [close window]

Effect size: Researchers have developed a metric called effect size for comparing results across studies that allows one to calculate the magnitude of a relationship between an intervention (e.g., the use of a software package) and an outcome (e.g., student test scores). Effect sizes are calculated for studies in which students from an intervention group and from a matched comparison group have been tested using the same measure of performance. The first step in calculating effect size is to calculate the difference between the average scores of students in the treatment or intervention group and the average scores of students in the comparison group. This result will determine whether the effect size is positive or negative. A positive score means the students who received the intervention outscored the students in the comparison group. The second step is to divide the difference in average scores by some measure of the variation in scores within the groups. Usually, the standard deviation of the comparison group is used as the denominator. One could say, then, that the effect size is the number of standard deviation units by which a treatment group outperforms a comparison group. Be aware that the design of a study may affect the size of reported effect sizes. Poorly designed studies may inflate the estimated effect size.

To understand the practical meaning of an effect size, it is important to know how an effect size compares with more familiar metrics or with results from other interventions. For example, an effect size of +1.0 is equivalent to about 15 points on an IQ test or 21 NCEs on the Stanford Achievement Test. Sometimes, effect sizes are reported in terms of "months of learning gain" or "grade equivalents." An effect size of 0.1 is to approximately 1 month of learning gain. Effect sizes of educational interventions are rarely as large as -1.0 or +1.0. For example, many educators believe reducing class size is an effective way to improve student learning, but effect sizes for studies of class size reduction are between +0.13 and +0.18. The practical meaning of the magnitude of an effect should always be interpreted in context (Glass et al., 1981). The effectiveness of a particular intervention should only be assessed in relationship to other interventions trying to bring about the same effect (e.g, an improvement in reading achievement or attitudes towards technology use). In addition the relative cost to produce the effect must be considered. Effect sizes as small as +0.10 may be of important practical significance if (1) the intervention that produced the improvement is relatively inexpensive compared to other competing options; (2) the effect is achieved across all groups of students; and (3) the effect accumulates over time.

[top] [close window]

Differential sample attrition: Sample attrition refers to when members of the sample drop out during the course of a study. This becomes a problem if students representing a particular population characteristic (e.g., low-income) drop out more frequently from one group, treatment, or comparison, compared to the other. This type of sample attrition is often referred to as differential sample attrition. As a result, the groups are no longer similar on an important characteristic that other research has found to be related to how well students perform on tests. This may make it difficult to interpret the reason for differences in test scores between the treatment and comparison groups. The fact that the groups are comprised of different types of students by the end of the study may explain why one group ends up having higher test scores at the end of the study than the other. Otherwise, you may incorrectly assume that the difference in test scores was caused by the intervention.

[top] [close window]

Grade equivalent score: The grade equivalent score is one of the most widely used, and confusing, metrics to report student performance. It describes test performance in terms of a grade level and the months since the beginning of the school year. For instance, a GE of 4.5 (sometimes reported as 45) is the fifth month of the fourth grade year. If a student in the third grade scores at the seventh grade level on a test, it does not mean that the third grade student is capable of doing seventh grade work. It means that the third grader did as well as a seventh grader taking the third grade test. Grade equivalent scores for individual students should never be averaged to create a school average. You should not use grade equivalent scores to compare student performance in different subject areas. Only NCE scores should be used for these purposes.

[top] [close window]

Inferential statistics: One use of statistics is to be able to make inferences or judgments about a larger population based on the data collected from a small sample drawn from the population. Exit polling used during elections to determine how the population of voters voted is an example of the use of inferential statistics. A key component of inferential statistics is the calculation of statistical significance of a research finding.

[top] [close window]

Interrupted time series design: One limitation of pre-post design is that it does not take into account that students were on a particular learning trajectory before the treatment. One way researchers may try to improve on pre-test/post-test design is to determine pre-intervention trends in performance. Researchers may collect information on student or school performance for the several semesters or years prior to the arrival of the intervention. These pre-intervention trends are then compared to the trends in outcomes following the introduction of the technology program. Any differences observed in the trends may be associated with the effectiveness of the intervention.

[top] [close window]

Norm-referenced scores: (from http://www.uiowa.edu/~itp/bs-interpret.htm) A norm-referenced score involves comparing a student's score with the scores other students obtained on the same test. How much a student knows is determined by the student's standing or rank within the reference group. High standing is interpreted to mean the student knows a lot or is highly skilled, and low standing means the opposite. Obviously, the overall competence of the norm group affects the interpretation significantly. Ranking high in an unskilled group may represent lower absolute achievement than ranking low in an exceptional, high-performing group.

[top] [close window]

Normal Curve Equivalent scores (NCE): A normal curve equivalent score is a type of norm-referenced score. It differs from percentile rank score in that it allows meaningful comparison between different test sections within a test. For example, if a student receives NCE scores of 53 on the Reading test and 45 on the Mathematics test, you can correctly say that the Reading score is eight points higher than the Mathematics score.

NCEs are represented on a scale of 1 - 99. This scale coincides with a percentile rank scale at 1, 50, and 99. Unlike percentile rank scores, the interval between scores is equal. This means that you can average NCE scores to compare groups of students or schools.

[top] [close window]

Percentile rank: A percentile rank (PR) score is a type of norm referenced score. A PR score indicates the percentage of pupils in the reference or norm group whose scores for a test fell below a particular pupil's raw score. The reference group is usually selected by the publisher of the test to represent the average school in the district, state, or country. A student's PR score will change for different reference groups.

A percentile rank score of 45 means that the student who scored at the 45th percentile scored better than 45% of the students in the reference group. The intervals between percentile rank scores are not equal. The intervals are much closer together in the middle - between 40 and 60 - than they are out at the higher and lower end of the distribution. Thus it is easier for students who score in the middle of the range to improve in percentile rank than it is for students at the bottom or the top. For this reason, individual percentile rank scores should never be averaged to compute an average percentile rank for a school nor should be used to measure change in student performance. In addition, it is inappropriate to use percentile rank scores to compare student performance for different subject areas. Only normal curve equivalents should be used for these purposes.

[top] [close window]

Population: The group with a particular set of characteristics to which researchers attempt to generalize their findings from a smaller sample. These are the objects of generalizations for inferential statistics.

[top] [close window]

Probability samples: The ideal way to select a representative sample is to set up a selection procedure whereby each student in the population has some chance of being in the study. Such samples are called probability samples.

[top] [close window]

Raw score (RS): A raw score is the number of items answered correctly for a test. These scores are used to derive the other norm-related scores such as percentiles, standard scores, and normal curve equivalents. A raw score by itself has little meaning. They can not be used to compare student performance across different subject areas or tests.

[top] [close window]

Sample: The sample is the group of people you select to be in your study. The sample should be representative of a larger population to whom you want to generalize.

[top] [close window]

Sample of convenience: Many researchers select students, classrooms, or schools for their samples because they are readily available. This type of sample is called a sample of convenience. You should be aware that students or schools who volunteer for a study or who are more readily available to the researchers may have certain levels of characteristics -- such as ability, motivation, or attitudes toward technology -- that make them a unique group. For this reason, use caution when trying to generalize research findings based on a sample of convenience to a larger population.

[top] [close window]

Scale scores: Scale scores are calculated by applying sophisticated computational procedures directly to the pattern of student responses to the items or questions on a test. Many tests use this as the basic measure of the student's performance. Scale scores can have different scale ranges such as college admissions tests like the ACT (range of 0 to 36) and the SAT (range of 0 to 1600). Scale scores are used primarily to provide a basis for deriving various other normative scores to describe performance (like percentile ranks). Unlike percentile rank scores, the interval between scores is equal. This means that you can average scale scores to compare groups of students or schools.

[top] [close window]

Sampling error: The difference in results for different samples of the same size is called sampling error. All things being equal, by increasing the sample size, from say 25 to 125 students, the sampling error will be reduced (but not eliminated) and the study findings can be assumed to be more reliable.

[top] [close window]

Statistical power: Statistical power is the probability you will detect a meaningful difference, or effect, if one were to occur. Ideally, studies should have power levels of 0.80 or higher -- an 80% chance or greater of finding an effect if one was really there. The "power" of any individual study depends on (1) the number of individuals in the study (sample size), (2) the expected size of the improvement in student performance from using the software (effect size); and (3) the precision of the measures used to assess student performance. All else being equal, statistical power increases with increasing sample size, larger expected effects, and more precise measures. As a general rule of thumb, researchers concerned about the "power" of their studies prefer to use sample sizes of at least 30 individuals in both the software use and non-use groups. As a general rule, larger numbers improve accuracy.

[top] [close window]

Statistical significance: All studies using inferential statistics involve the application of statistical significance -- the probability that a result or outcome is larger or smaller than would be expected by chance alone. Standard scientific practice usually assumes a probability value (or p-value) of less than 1 in 20 (p < 0.05) for statistical significance. Small values for p indicate that the differences between groups are probably "true" differences and not due to happenstance. Large values for p suggest that group difference may be due to chance, so that in reality, there is a strong probability that no differences exist between the groups.

It is important to remember that just because an analysis revealed statistical significance, it does not necessarily imply substantive or practical significance. A research finding may be true without being important. A research finding that is reported as being "highly significant" is probably true, but not necessarily "highly important."


Please send comments and suggestions.

This site was created by the Center for Technology in Learning at SRI International under a task order from the Planning and Evaluation Service, U.S. Department of Education (DHHS Contract # 282-00-008-Task 3).



Last updated on: 11/04/02