The Validity and Precision of the Comparative Interrupted Time Series Design and the Difference-in-Difference Design in Educational Evaluation

| Marie-Andrée Somers, Pei Zhu, Robin Tepper Jacob, Howard Bloom

In this paper, we examine the validity and precision of two nonexperimental study designs (NXDs) that can be used in educational evaluation: the comparative interrupted time series (CITS) design and the difference-in-difference (DD) design. In a CITS design, program impacts are evaluated by looking at whether the treatment group deviates from its baseline trend by a greater amount than the comparison group. The DD design is a simplification of the CITS design — it evaluates the impact of a program by looking at whether the treatment group deviates from its baseline mean by a greater amount than the comparison group. The CITS design is a more rigorous design in theory, because it implicitly controls for differences in the baseline mean and trends between the treatment and comparison group. However, the CITS design has more stringent data requirements than the DD design: Scores must be available for at least four time points before the intervention begins in order to estimate the baseline trend, which may not always be feasible.

This paper examines the properties of these two designs using the example of the federal Reading First program, as implemented in a midwestern state. The true impact of Reading First in this state is known, because program effects can be evaluated using a regression discontinuity (RD) design, which is as rigorous as a randomized experiment under certain conditions. The application of the RD design to evaluate Reading First is a special case of the design, because not only are all conditions for internal validity met, but also impact estimates appear to be generalizable to all schools. Therefore, the RD design can be used to obtain a “causal benchmark” against which to compare the impact findings obtained from the CITS or DD design and to gauge the causal validity of these two designs.

We explore several specific questions related to the CITS and DD designs. First, we examine whether a well-executed CITS design and/or DD design can produce valid inferences about the effectiveness of a school-level intervention such as Reading First, in situations where it is not feasible to choose comparison schools in the same districts as the treatment schools (which is recommended in the matching literature). Second, we explore the trade-off between bias reduction and precision loss across different methods of selecting comparison groups for the CITS/DD designs (for example, one-to-one versus one-to-many matching, and matching with replacement versus without replacement). Third, we examine whether matching the comparison schools on pre-intervention test scores only is sufficient for producing causally valid impact estimates, or whether bias can be further reduced by also matching on baseline demographic characteristics (in addition to baseline test scores). And fourth, we examine how the CITS design performs relative to the DD design, with respect to bias and precision. Estimated bias in this paper is defined as the difference between the RD impact estimate and the CITS/DD impact estimates.

Overall, we find no evidence that the CITS and DD designs produce biased estimates of Reading First impacts, even though choosing comparison schools from the same districts as the treatment schools was not possible. We conclude that all comparison group selection methods provide causally valid estimates but that estimates from the radius matching method (described in the paper) are substantially more precise due to the larger sample size it can produce. We find that matching on demographic characteristics (in addition to pretest scores) does not further reduce bias. And finally, we find that both the CITS and DD designs appear to produce causally valid inferences about program impacts. However, because our analyses are based on an especially strong (and possibly atypical) application of the CITS and DD designs, these findings may not be generalizable to other contexts.