Agenda, Scope, and Goals
Extensive literature, resources, and tools are available to help researchers determine power, sample size requirements, or the MDES for a single, non-adjusted test, as well as design education studies with adequate sample sizes — for example, Dong (2013), Spybrook et al. (2011), Raudenbush et al. (2011), Hedges and Rhoads (2010), Bloom, Richburg-Hayes, and Black (2007). However, MDRC has found no education or impact evaluation literature on estimating power, sample size, or MDES while accounting for multiplicity adjustments. The IES guidelines for multiple testing (Schochet, 2008) state that “statistical power calculations for confirmatory analysis must account for multiplicity,” but give no explanation for how to do so in the case that multiple testing procedures are used to adjust p-values. This project will fill this gap. It will investigate alternatives to standard practice for how power, sample size, or MDES are estimated in studies that adjust p-values with multiple testing procedures. It will also investigate alternatives to standard practice for how power is defined in studies that adjust p-values with multiple testing procedures.
This project will develop, implement, and test methods for estimating power, sample size requirements, or MDES’s while accounting for multiplicity adjustments using one of three common multiple testing procedures used in education — the Bonferroni, Benjamini-Hochberg, and Westfall-Young procedures. To develop these methods, the research team will look to relevant literature in the fields of medicine, genomics, and biostatistics. The research team will find and modify appropriate methods so that they address the issues specific to education research and their implementation is feasible for a wide range of applied education researchers. The final product will provide intuitive, step-by-step guides on how to implement the recommended set of methods and sample computer code. The project will also use the proposed methods to illustrate and compare power and sample size implications of the different multiple testing procedures and of the different power definitions under various realistic scenarios.