An Educator's Guide to

Red Flags: Problems to Watch for in Studies of Software Effectiveness

There are a number of possible problems with studies of the effectiveness of educational software. Some of these are "red flags" that reduce the overall quality of research and that may lead readers to misjudge the size or magnitude of a software's likely effect in a particular setting. Below we discuss some of the most commonly encountered problems to look for when reviewing studies of program effectiveness of education technology:

Selecting a special group of students, teachers, and sites
In medical research, participants are selected randomly to be part of a trial of a new medication or form of treatment. By selecting participants randomly, medical researchers have participants from a variety of backgrounds and personal medical histories that could affect the results. Frequently, educational research observes students and teachers in classrooms in schools that have not been selected randomly. These students may have special characteristics, such as being high-performing, at the outset. Teachers may have volunteered to use the software, and these teachers may be different from teachers in a comparison group. Schools in the treatment group may have a strong commitment to technology. Selection of particular kinds of students and sites in a research study limits the conclusions one can draw: if the study is otherwise well-designed, we can know for sure only that the software has been effective with this kind of student or this kind of school.

Using a comparison group that is different from the treatment group
Many studies use a comparison group to help determine whether having students use software to practice reading or math skills is more effective than other methods that might be used in the classroom. In these studies, there may be only one time that student achievement is measured—at the end of the treatment. In some cases, student achievement is measured at the beginning and the end of the treatment (a pre-post design). In either design, it is important that the comparison group be similar to the treatment group with respect to the students' backgrounds, prior levels of achievement, and any other factors that could affect achievement. If the comparison group is too different from the treatment group, results may be biased. For example, the treatment group may have higher prior levels of achievement than the comparison group. Alternately, initial performance of the treatment group may be much lower, with more opportunity for scores to increase than the comparison group. Some researchers randomly select students to be part of a treatment or a comparison group, helping to ensure comparable treatment and comparison groups. Other researchers use other techniques to match students by using demographic profiles of schools to find similar groups of students. These alternative methods of matching samples are not as sound because they increase the likelihood that groups will differ in other ways such as instructional program, class size, or class scheduling.

Selecting an inappropriate test of student achievement
Researchers must select some test to measure a program's effects. On the one hand, it is important that this test be sensitive to the kind of instruction that takes place in the program. For example, if the software addresses reading skills, student scores on a reading test might be expected to improve. At the same time, one could not expect students' scores on a test of mathematics skills to improve in such a program. Sometimes researchers measure technology's effectiveness by using tests that fit the specific technology program goals so narrowly that they do not reflect more common and familiar academic outcomes. Ideally, researchers use tests that have been validated for use across more than one program but that are also sensitive to the kinds of things students might be expected to learn, given the software's design. Other times researchers use tests that subject students to performance conditions that differ from the performance conditions supported by the technology. Such tests may not capture an actual effect because students are performing at a disadvantage (e.g., without a calculator).

Using a small sample
All else being equal, studies with larger numbers of students, classrooms or schools are of higher quality than studies with significantly fewer participants. When too few subjects are included in a study, the validity of the findings may be questionable. For example with larger sample sizes the research study is more likely to detect meaningful effects of the intervention on students if the effects in fact occurred. Readers should be particularly wary of studies using small sample sizes (fewer than 30 subjects) that report no effect of the use of the technology on student performance. In many such cases, there are too few subjects in such studies to detect an effect even when there might have been one. In addition, the smaller the sample size the higher the probability that the findings could be due purely to chance and might not be replicated if repeated with different set of students, classrooms, or schools. Lastly, with the use of significantly smaller study samples, the findings from a single study are less likely to accurately reflect the typical experience of the typical student or classroom in the population of interest. Thus, in most cases where small sample sizes are used, you may not be able to generalize the findings from a single study to a larger population.

Not documenting the duration of students' exposure to the software
All too often, studies of effectiveness do not describe how often or for how long students used the educational software. Not reporting such information makes it difficult to know not just how much exposure is needed to achieve the results the researcher found but also whether the software is effective at all. It may be that there was very little use of the software at all, but other reforms led to the results found (see Confounding below). It may also be that students used the software quite extensively, and for such a length of time that it would be impossible for another school to duplicate the results.

Studying effectiveness too early in a software program's use and the effects of novelty
Longer is better. Results are more likely to be applicable if the study was conducted for at least an entire semester of the school year. Programs tested for shorter periods of time may underestimate their impact if teachers and students haven't learned how to incorporate it with instruction. In addition, research has found that newly introduced technologies in the classroom may initially inflate scores due to a short-lived period of teacher and student excitement over access to the new program ("novelty effect"). Program effectiveness measured over the course of an entire semester or school year will provide a more realistic measure of the impacts a school can expect to achieve due to regular use of the software.

The evaluator is not independent of the vendor
Vendors typically hire evaluators to conduct studies. They may even suggest to evaluators particular research methods or ways to frame results in such a way to present the program's effectiveness in a more favorable light. At the same time, evaluators typically follow a code of ethics that require them to be fair in their assessment of programs and to be willing to report on less favorable results. Even so, knowing who sponsored a particular evaluation study and making sure that the data reported match the conclusions drawn by the evaluator is an important part of judging research quality.

Confounding: Separating the effect of software use from the effect of other changes in the school
When interpreting the results from studies that do not use a no-use comparison group, you must be aware of other factors taking place at the same time within the school and district that could explain the results. Often times the introduction of technology programs is accompanied by other changes within schools and districts that may influence student performance, including changes in the curriculum, instructional practice, teacher resources, school organization, and how students are assessed. Without the use of a well-matched comparison group, it is extremely difficult to separate the influence of the technology intervention from the influence of other outside factors. Scores on the testing instrument may be climbing across all schools in the district regardless of any impact due to the use of technology in the classroom.

Students in treatment groups dropping out of the study at different rates from students in comparison groups
Differential sample attrition is a problem you must be aware of any time a study compares a group of students who used the software to a group who did not. Two groups that are almost identical at the start of the study might become significantly different before the study ends if students drop out of the groups before the end of the study at different rates or for different reasons. Long-term studies are particularly susceptible to differential sample attrition. See if the author reports on the loss of subjects from either group during the course of the study. Information on attrition may appear in other ways such as in tables used to describe the samples and results. See if you can determine whether the sample sizes for the groups at the beginning and end of the study remain the same or are only slightly different. If the difference is large, significant improvements in test scores attributed to the use of the software may be nothing more than the result of the differences in the types of students that remained in the two groups.

Please send comments and suggestions.

This site was created by the Center for Technology in Learning at SRI International under a task order from the Planning and Evaluation Service, U.S. Department of Education (DHHS Contract # 282-00-008-Task 3).

Last updated on: 11/04/02