
Our paper explores the possibility of randomly
assigning groups (or "clusters") of individuals
to a program or a control group in order to estimate the impacts
of programs designed to affect whole groups. This cluster
assignment approach maintains the primary strength of random
assignmentthe provision of unbiased impact estimatesbut
has less statistical power than random assignment of individuals,
which usually is not possible for programs focused on whole
groups. To explore the statistical implications of cluster
assignment we: (1) outline the issues involved, (2) present
an analytic framework for studying these issues, and (3) apply
this framework to assess the potential for using the approach
to evaluate education programs targeted on whole schools.
Our findings suggest that cluster assignment of schools holds
some promise for estimating the impacts of education programs
when it is possible to control for the average performance
of past student cohorts or the past performance of individual
students.

Over the past several decades there has been
considerable experience with randomized experiments to measure
the impacts of social programs (Greenberg and Robins 1986).
At the heart of this approach is the random assignment of
individuals to a program group, which is supposed to receive
program services, or a control group, which is not supposed
to receive these services. Only chance determines who is and
is not selected to receive services, and each eligible applicant
has the same chance of being selected.1
This process creates program and control groups
with no systematic pre-existing differencesthe expected
values of all background characteristics, whether measured
or not, are the same for both groups. Although in small samples
these characteristics may differ by chance due to random sampling
error, the margin for error decreases as sample size increases.
Hence, the control group experience provides an unbiased estimate
of what the program group experience would have been without
the program. The difference between these two experiences
therefore represents an internally valid estimate of the program
impactwhat it caused to happen. No other methodology
provides the same level of internal validity for program impact
estimates (Hollister and Hill 1995).
However, some programs are designed to
affect whole groups at once, not separate individuals.
For example, school reform initiatives are intended to affect
whole schools, comprehensive community initiatives are designed
to affect whole communities and health education programs
are often targeted on whole geographic areas (Lasoff et al.
1994; Connell et al. 1995; and Murray et al. 1994). For these
programs it usually is not possible to randomly assign individuals
to a program or control group. If a school, a community, or
a geographic area is selected for athe program, then everyone
within the group is potentially affected by the program.
However, it might be possible to randomly
assign whole groups or "clusters". For example,
one could randomly select schools for a new education program
from among those eligible to participate. The schools not
chosen would provide a valid control group because their expected
background characteristics would be the same as those for
the schools chosen. However, randomly assigning whole schools
will produce a smaller effective sample than randomly assigning
individual children from these schools. Reducing the effective
sample size, in turn, will reduce the statistical power of
program impact estimates.
Therefore, random assignment of clusters ("cluster
assignment") can produce impact estimates that are internally
valid but may have limited statistical power. As described
below, the extent to which clustering reduces statistical
power depends on the composition of the clusters involved.
This, in turn, depends on how clusters are defined. Cluster
assignment may therefore produce adequate statistical power
for some purposes but not for others.
Our paper examines the statistical power of
cluster assignment empirically. We first outline the statistical
issues involved and present an analytic framework for studying
them. We then use this framework to assess one potential application
of cluster assignmentestimating the impacts of education
initiatives. Our findings suggest that cluster assignment
holds some promise for this application when it is possible
to control for the average performance of past student cohorts
or the past performance of individual students. The paper
concludes by considering the generalizability of our findings
and by exploring several problems that can arise when cluster
assignment is used in practice.
Consider the following situation. A random
assignment study takes place in J sites (schools, communities,
geographic areas, etc.), each containing n sample members
(students, residents, etc.). Thus, the sample for the study
consists of nJ individuals. For simplicity assume that
half the sample is assigned to a program group and half to
a control group, although our argument holds for any program/control
mix. There are three main ways to make this assignment randomly:
- Blocked random assignment
would randomly assign half the individuals from each
site to the program and half to the control group.
- Cluster random assignment would
randomly assign half the sites to the program and
half to the control group.
- Simple random assignment
would randomly assign half of all individuals to
the program and half to the control group, ignoring their
sites.
In each case, the difference between the mean
outcome for the program group and that for the control group
is a valid estimator of the program impact, because its expected
value equals the true impact. The main difference in the statistical
properties of the three approaches is their statistical power.2
These differences depend on the extent to which individuals
vary within sites and the extent to which sites differ, on
average, from each other. Consider the following two extreme
cases.
- All variation is between sites. If
the outcome level is the same for all individuals in a site
have the same outcome, but mean outcomes vary across sites,
then blocked assignment will have no random sampling error.
By ensuring that the program and control groups represent
each site in the same proportion, blocking will ensure that
the groups are identical, regardless of who is selected
from each site. In contrast, random sampling error for cluster
assignment will be at its maximum, because program and control
group differences will depend entirely on which sites are
chosen for each group. Random sampling error for simple
random assignment will lie between that for cluster assignment
and blocked assignment.
- All variation is within sites.
If the mean outcome is the same for each all sites is the
same, but outcomes vary across individuals, then cluster
assignment will only reflect sampling error due to who is
selected from each site, and it thus will be equivalent
to simple random assignment. Blocked assignment also will
be equivalent to simple random assignment because there
is no margin for blocking by site to reduce sampling error.
In practice there usually is variation within
and between sites. Hence, blocking by site will reduce random
sampling error and clustering by site will increase it. Therefore,
cluster assignment will have the least statistical power for
any given total sample size, nJ.
As noted above, however, for programs that
affect whole groups, it usually is not possible to randomly
assign individual sample members. Hence, neither blocked random
assignment nor simple random assignment is feasible; cluster
assignment is the only option available. Thus it is important
to find a way to assess the statistical power of this approach.
As a first step toward this end, we restate the program and
control group difference of means as the following regression
with between-site and within-site error components.
|
|
(1)
|
Where
Yij = the outcome for
individual i infrom site j,
a = the mean outcome
for the control population,
B0 = the true program impact,
Pij = one for individuals subject to
the program and zero for others,
ej = the error component for site j,
which is independently and identically distributed with
mean zero and variance t2,
e ij = the error component
for individual i from site j, which is independently and
identically distributed with mean zero and variance s
2.
The coefficient, B0, is
the difference between the mean of outcome Y for persons
subject to the program and the mean of Y for those
not subject to it (the true program impact). The sample-based
estimate of B0 is the program and control
group difference of means, b0. Random
sampling error has two components: ej, to
represent site-specific differences in mean outcomes, and
e
ij, to represent individual differences
in outcomes within sites. The variance of the site-specific
error component is represented by t
2 and the variance of the individual-specific
error component is represented by s
2.
The expected value of the impact estimator,
b0, equals the true program impact, B0.
The standard error of the impact estimator for cluster assignment
is (Raudenbush 1997):
 |
(2) |
This standard error has a component due to
between-site sampling error, ,
and a component due to within-site sampling error, .
The preceding discussion assumes that a simple
program and control group difference of means is used to estimate
program impacts. However, most randomized experiments use
a regression-adjusted difference of means to reduce the standard
error of impact estimates by controlling statistically for
background characteristics that are correlated with outcomes.
Equation 3 specifies such a regression with a program variable,
Pij, an individual-level single background
characteristic or covariate, Xij, and two
error components, e*j and e
*ij.
 |
(3) |
Note that Xij can be a group
characteristic or an individual characteristic and impact
regressions can include any number or mix of these characteristics.
Of particular value in this regard are measures of past performance
for the same individuals (their "pre-test" scores)
or measures of average performance for previous groups (or
"cohorts") from the same cluster. The true program
impact is still B0 and the new regression-adjusted
impact estimator is b0*. Once again, the
expected value of the impact estimator, E(b0*),
equals the true impact, B0;.
Standard errors for regression-adjusted impact
estimators represent generalizations of their counterparts
for difference of means estimators. For example, consider
the following standard errors for cluster assignment with
a regression-adjusted impact estimator using a single group
characteristic, Xj, or a single individual
characteristics, Xij, as the covariate (Raudenbush
1997).
For a single group characteristic:
 |
(4) |
For a single individual characteristic:
 |
(5) |
There are two major differences between Equations
4 and 5 for regression-adjusted impact estimators and Equation
2 for a simple treatment and control group difference of means.
The first difference involves the inflation factor
or ).
This factor rapidly approaches one as J (the numbers
of clusters) in Equation 4 and nJ (the total number
of individuals) in Equation 5 increase.
For example, with J equal to 10 (five
program clusters and five control clusters) this inflation
factor equals 1.08 in Equation 4. For J equal to 20
the inflation factor equals 1.03. Thus, for all but very small
numbers of clusters, the inflation factor equals roughly one.
Likewise, for nJ equal to 100 (fifty program individuals
and fifty control individuals) the inflation factor is 1.005
in Equation 5 and for nJ equal to 500 it is 1.001.
Thus, even for small numbers of clusters the inflation factor
for an individual covariate has virtually no effect on the
standard error of the impact estimator.
A second, more important difference between
the standard errors for difference of means estimators and
their regression-adjusted counterparts under cluster assignment
involves the error component variances (t
2 and s
2 , t
2* and s
2*, or t
2** and s
2**). To the extent that a covariate "explains"
some of the variation in the original error components, the
variance of the remaining unexplained error, and thus the
standard error of the program impact estimator, decreases
accordingly.
A group covariate will have all of its effect
on between-cluster error. It cannot explain within-cluster
error because it has no within-cluster variation. Hence, a
group covariate can reduce t
2 but not s
2. An individual covariate can reduce both
between-cluster and within-cluster error variance. However,
it might reduce between-cluster variance by less than
would be possible for a group characteristic. Hence, it is
not clear a priori whether a group characteristic or
an individual characteristic will provide the most effective
covariate for estimating program impacts.
Therefore, when considering cluster assignment
to measure the impacts of a program, the following statistical
issues should be addressed
- How does the outcome of interest vary
between clusters and within clusters? In other words, what
are the values of t2
and s2?
- To what extent can group characteristics
and/or individual characteristics reduce these variance
components and thus reduce the standard error of program
impact estimates?
- Given the likely size of the clusters of
interest and the covariates available for a study, how many
clusters are required to provide enough statistical power
to make the study worthwhile?3
Below we illustrate how to address these
questions for studies of school-wide programs to improve student
performance on standardized tests.
Our basic approach is to explore the statistical
implications for cluster assignment of the variance component
structure of standardized test scores within and between elementary
schools in one medium-size city, Rochester, New York. We do
not actually compute program impact estimates because we do
not examine a specific program. Instead we infer what
the statistical properties of a cluster assignment impact
estimator would be if it were applied to a situation like
that of Rochester elementary schools.
Specifically, we use standardized test scores
for individual students from 25 Rochester elementary schools
to compute the between-school variance, t
2, and the within-school variance, s
2, separately by: grade (for third grade
and sixth grade), subject (math and reading), year (1989,
1990,1991, and 1992) and different "impact estimation
models"(covariate specifications).4
Specific grades and subjects are studied separately because
the impacts of educational programs typically are reported
by grade. Findings for different years are reported separately
to examine their stability over time. Findings for different
impact estimation models are reported separately to assess
their ability increase the statistical power of cluster assignment
designs.
For each combination of grade, subject, year
and model specification we estimate t
2 and s
2 and use these estimates to project the
"minimum detectable effect" and "minimum detectable
effect size" for different cluster assignment samples.
Intuitively, a minimum detectable effect or a minimum detectable
effect size is the smallest effect that a particular research
design has a "good chance" of detecting. The smaller
the minimum detectable effect or minimum detectable effect
size is, the greater the statistical power of the research
design is. A minimum detectable effect is expressed in the
original units of the outcome measure (in our case, scale
scores from a standardized test) whereas a minimum detectable
effect size is expressed as a proportion of the standard deviation
of the outcome measure (in our case, the sample standard deviation
of individual scale scores).5
Using these two related metrics we address the question:
"How many schools are needed to
provide adequate statistical power for a cluster assignment
design intended to measure program impacts on student performance?"
The following 10 impact estimation models
are examined.
Basic Approach
(with no covariates)
|
Model 1:
|
A program and control group
difference of mean test scores |
Cohort Approaches
(with group-level covariate/s only)
|
Model 2:
|
Controlling
for the mean test score of different students who were
in the same grade in the previous year (Yjt-1) |
|
Model 3:
|
Controlling
for the mean test score of different students who were
in the same grade in the previous year and the
year before that (Yjt-1 and Yjt-2) |
|
Model 4:
|
Controlling
for the mean test score of different students who were
in the same grade two years earlier (Yjt-2) |
|
Model 5:
|
Controlling
for the mean test score of different students who were
in the same grade two and three years earlier (Yjt-2
and Yjt-3) |
Longitudinal Approaches
(with individual covariate/s only)
|
Model 6:
|
Controlling
for each individual students test score in the previous
yeargrade (yijt-1) |
|
Model 7:
|
Controlling
for each individual students test score two grades
earlier (yijt-2) |
|
Model 8:
|
Controlling
for each students test score in each of the previous
two grades (yijt-1 and yijt-2) |
Combined Approaches
(with individual and group covariates)
|
Model 9:
|
Controlling
for each individual students test score in the previous
year (yijt-1) and the mean score of
different students in the same grade a year earlier (Yjt-1) |
|
Model 10:
|
Controlling
for each individual students test score two years
prior (yijt-2) and the mean score of
different students in the same grade two years earlier
(Yjt-2) |
Model 1, a simple treatment and control group
difference of means, serves as our point of departure. The
other models are versions of Equation 3 with different measures
of past student test scores as covariates.
Models 2-5 use cohort data for each school/grade
school (clustercluster) to control for the mean test scores
of its past student cohorts. Hence, these models rely on group
covariates to control for "school effects". Model
2 controls for the average performance of each schools
most recent past cohort (last years students in
the same grade). Model 3 controls for the average performance
of each schools two most recent cohorts. These models
could be used to estimate the impacts of a program during
its first year of implementation. By comparing their projected
statistical power one can assess the value of obtaining data
on two years versus one year of past school-level performance.
Models 4 and 5 "skip-over" the most
recent past cohort (last years students) and therefore
could be used to estimate the impacts of a program during
its second year of implementation (when last years students
could have been affected by the program). By comparing the
statistical power of Models 4 and 5 to that for Models 2 and
3, it is possible to project the loss of power that will occur
if one has to wait a year before estimating the impacts of
a program that is slow to startup. This comparison also illustrates
how statistical power will erode as one moves from impact
estimates for the first year of program follow-up to estimates
for the second year.
Models 6, 7 and 8 use individual longitudinal
data to control for each students own the individual
past test scores of present sample members. Hence, they rely
on individual covariates to control for "individual effects".
Model 6 controls for each students sample members
test score during the immediately preceding year and Model
7 controls for his or her own score two years earlierpreviously.
Model 8 controls for sample members test scores in both
of during the past two years. Hence, Models 6 and 8 can be
used to estimate program impacts during the first year of
program implementation and Model 7 can be used for the second
year, in cases where test scores for a previous year may have
been affected by an ongoing program.
Models 9 and 10 complete our analysis by representing
combinations of both individual and group covariates. Model
9 lags past individual performance and the performance of
previous cohorts and group performance by one by one year,
and hence, could be used to estimate impacts in the first
year of program implementation. Model 10 lags past individual
and group performance characteristics by two years and hence,
could be used to estimate impacts in year two of program implementation.
As indicated above, we do not actually compute
program impacts with each model. Instead, we use the results
of standardized tests in Rochester to examine the error component
structure of each model and thereby infer what its statistical
properties would be if it were used to estimate program impacts
in a similar environment.
For each impact estimation model, year, subject
and grade, t 2 and s 2
were computed using variance components analysis. For Model
1 the variance components were computed directly from individual
test scores. For Models 2 through 10 they were computed in
three steps: (1) individual scores were regressed on the appropriate
group or individual covariate/s; (2) residuals were computed
for this regression, and (3) the SAS VARCOMP procedure was
used to estimate the variance components of the residuals
(SAS Institute 1989).6
The PEP test is "norm-referenced"
not "criterion-referenced." It does not translate
into grade-equivalents or any other absolute criterion. Hence,
its results only have meaning relative to the distribution
of scores for a reference group; they do not have meaning
in terms of specific identifiable knowledge, ability, or skills.
The reference group we use to interpret PEP
test scores is our analysis sample. Table
1 summarizes the distribution of individual PEP test scale
scores for this sample by grade, subject and year. Several
points are important to note about these distributions. First
note that although mean scores differ by grade and subject
(because they represent different subject matter), they do
not differ much over time (suggesting that different versions
of the same test were similar and that average student performance
did not change much in Rochester during the four-year period
we examined).
Second, note that the standard deviation of
individual scores is about 10 or 11 points for all subjects,
grades and years examined. Hence, as explained later, to convert
minimum detectable effects (in scale scores) to minimum detectable
effect sizes (as a proportion of the standard deviation) we
divide the former by roughly 10 or 11.
Third, note how the distribution of scale
scores translates into a percentile distribution. For each
subject, grade and year, the difference in scale scores between
the 25th and 75th percentiles is 14
to 17 points. In other words, a percentile difference of 50
points reflects a scale score difference of 14 to 17 points.
This implies that each one-point difference in scale scores
represents a three-percentile difference. This relationship
plays an important role in our interpretation of minimum detectable
effects reported later.
Two different samples are used for our analysis:
the full sample and a longitudinal sub-sample. The full sample
is used to examine the cohort Models 2-5. It contains between
1,724 and 1,815 third-graders and between 1,229 and 1,475
sixth-graders each year. A sub-sample of students with test
scores for multiple years is used to examine longitudinal
approaches. Specifically, we focus on sixth-graders with test
scores for fourth, fifth and sixth grade and on third graders
with test scores for first, second and third grade. It was
only possible to obtain these longitudinal data for third-graders
or sixth-graders who took PEP tests in 1991 and 1992.7
In the discussion which follows, we first
present findings for the cohort approaches (Models 1
5) based on data for the full sample. We then present findings
for the longitudinal approaches (Models 6 - 8) and combined
approaches (Models 9 and 10) based on data for the longitudinal
sub-sample. To facilitate a direct comparison of the cohort
approaches, the longitudinal approaches and the combined approaches
for the same individuals, we also present findings for the
cohort approaches for the longitudinal sub-sample.
By definition, the minimum detectable effect
of a study is the smallest true effect that has a W
percent chance of producing an impact estimate that is statistically
significant at the Z level. For a one-tail hypothesis
test (to assess program-induced improvement, not just change)
at the 0.05 level of statistical significance (Z),
with 80 percent statistical power (W), the minimum
detectable effect is 2.5 times the standard error of an impact
estimator.8
In other words:
 |
(6) |
Substituting Equation
2 into Equation 6 yields:
 |
(7) |
As can be seen, that
the minimum detectable effect for cluster assignment is:
- inversely proportional
to the square root of the total number of clusters, J;
- inversely but
not proportionally related to the number of individuals
per cluster, n; and
- directly but not
proportionally related to the between-cluster and within-cluster
variance components, t 2 and s 2.
Minimum detectable
effects are a simple way to express statistical power, but
to interpret them requires a basis for judging their policy
relevance. From a benefit-cost perspective, one might ask
whether a proposed study could detect the smallest effect
that would make a program "break even". From a political
perspective, one might ask whether the study could detect
the smallest effect that would be deemed as "having made
an important difference". From a programmatic perspective,
one might ask whether the study could detect an effect that
had a "reasonable chance of being achieved". Which
perspective is applied, and what data are used to inform it
will vary from application to application. But as with any
measure of statistical power, some such determination must
be made in order to interpret it.
Table
2 presents minimum detectable effects for the cluster
assignment of 10, 20, 30, 40 and 60 schools (half to a program
group and half to a control group), given our estimates of
t 2
and s
2 and assuming a grade-size of 60 students-per
school. Consider the mean findings in column five for Model
1. For 10 schools (600 students), the minimum detectable effect
would be 6.9 points; for 20 schools (1,200 students) it would
be 4.9 points, for 30 schools (1,800 students) it would be
4.0 points, for 40 schools (2,400 students) it would be 3.4
points and for 60 schools (3,600 students) it would be 2.8
points. The situation improves considerably if we control
for the performance of at least one recent cohort. For example,
under Model 2, the minimum detectable effect would be 4.3,
2.9, 2.3, 2.0, or 1.6 points for 10, 20, 30, 40, or 60 schools,
respectively.
Because there is no absolute benchmark for
interpreting a specific change in PEP test scores, it is difficult
to determine whether the minimum detectable effects for Models
2 through 5 would be adequate for studying a particular educational
initiative. One way to do so, however, is to compare this
change to the distribution of individual test scores in Table
1.
As noted earlier, a 2- point difference in
PEP scores represents a 6-percentile difference in the distribution
of individual scores. Thus, a minimum detectable effect of
2 points (in the range attainable using 40 to 60 schools)
is equivalent to raising the mean performance of program group
members by 6 percentile points. For example, it would be equivalent
to raising their performance from the 50th to the
56th percentile in the original sample distribution.
It seems reasonable to expect a major educational initiative
to produce an improvement of at least this magnitude in order
for it to be deemed successful. Hence, a minimum detectable
effect in this range might be acceptable for an evaluation
of a major educational initiative.
To provide another perspective on the magnitudes
of the findings in Table 2 we transformed them into minimum
detectable effect sizes by dividing each by the average full-sample
standard deviation for the PEP scale scores for the subject
and grade involved. These results are presented in Table
3.9
Measuring effect size in units of standard
deviations is a common way to standardize impact estimates
from different studies in order to summarize and compare them.
This approach is especially useful for meta-analyses of treatment
effectiveness studies which report impacts in different units
and for different types of outcomes (Glass, McGaw and Smith
1981).
Effect size is also a common metric for discussions
of statistical power in the behavioral sciences. To provide
guidance for researchers Cohen (1977, 1988) suggests that
effect sizes around 0.20 be considered small, those around,
0.50 be considered medium, and those above 0.80 be considered
large. These guidelines have been used by researchers in many
fields for many years.
Lipsey (1990) provides an empirical justification
for Cohens effect-size standards based on the distribution
of 102 mean effect sizes obtained from 186 meta-analyses of
6,700 studies representing 800,000 sample members. The lower
third of this distribution (small effects) ranged from 0.00
to 0.32; the middle third (medium effects) ranged from 0.33
to 0.55; and the upper third (large effects) ranged from 0.56
to 1.20. The majority of meta-analyses in Lipseys (1990)
summary represent educational research, and the distribution
of effect sizes is about the same for educational research
and non-educational research (Lipsey 1990, p. 54). Hence,
these findings provide a relevant guide for judging the minimum
detectable effect sizes in Table 3.
As can be seen, if it is possible to control
for the performance of at least one recent cohort from each
school (Models 2 5), the minimum detectable effect
size for cluster assignment of 30 to 60 schools (1,800 to
3,600 student, respectively) is about 0.20. This would be
considered a small effect, both according to Cohens
(1977, 1988) guidelines and Lipseys (1990) empirical
findings.
Our analysis of longitudinal impact estimators
focuses on the sub-sample of Rochester sixth-graders and third-graders
with individual test scores for three consecutive years. As
noted above, this information could only be obtained for the
1991 and 1992 samples and it was available for 85 to 90 percent
of these sample members.10
For each sub-sample we computed t
2 and s
2 for each impact estimation model. We then
computed the minimum detectable effects and minimum detectable
effect sizes in Table
4. The first two columns in the table report minimum detectable
effects measures in PEP test scores; the last two columns
present minimum detectable effect sizes computed as a proportion
of the standard deviation of PEP test scores. All estimates
are for a cluster assignment design with a total of 40 schools
and 60 students per school, with half of the schools randomly
assigned to a program and half to a control group. The findings
suggest that:
- Controlling for past average school performance
or past individual student performance markedly reduces
the minimum detectable effect and minimum detectable effect
size for all but third grade math scores (discussed below).
- Controlling for more recent past performance
reduces the minimum detectable effect and minimum detectable
effect size by more than controlling for less recent past
performance.
- Controlling for two years of past performance
reduces the minimum detectable effect and minimum detectable
effect size by slightly more than controlling for one
year of past performance.
- Controlling for past individual performance
reduces the minimum detectable effect and minimum detectable
effect size by slightly more than controlling for past
school performance, for all but third grade math scores
(discussed below).
- In general, controlling for both past
school performance and past individual performance reduces
the minimum detectable effect and minimum detectable effect
size by more than controlling for only one of these alternatives.11
In general, the greatest increase in statistical
power is produced by individual measures of recent past test
performance (Model 6). The minimum detectable effect for this
approach ranged from about 1 to 2 scale score points, which
represents roughly 3 to 6 percentiles (for all but third-grade
math scores, where controlling for past school performance
was far more effective). The corresponding minimum detectable
effect size ranged from 0.10 to 0.20. Hence, by controlling
for recent individual test scores it is possible for cluster
assignment with 40 schools and 60 students per school to detect
relatively small improvements in school performance.
The next greatest increase in statistical
power was produced by measures of recent past school performance,
with a minimum detectable effect around 2 scale score points
or 6 percentiles and a minimum detectable effect size ranging
from about 0.15 to 0.20. Findings for the other models also
suggest an ability to detect fairly small program impacts.
The one apparent anomaly in the findings,
is that past individual-level performance (Models 6-8) has
very little effect on statistical power for estimating program
impacts on third-grade math scores. This result was obtained
both for 1991 and 1992. One potential explanation for it is
that third-grade math tasks differ fundamentally from those
for first and second grade. Hence, first and second grade
math performance may not provide strong predictors for third-grade
performance.
This paper was motivated by the need for a
rigorous way to estimate the impacts of programs designed
to affect whole groups. For such programs it usually is not
possible to randomly assign individuals to a program or control
group. As an alternative, we explore the possibility of randomly
assigning groups or clusters. Although the statistical theory
of cluster sampling has been known for many years (Cochran
1963), the properties of cluster assignment for specific applications
are not well known. Thus to explore the feasibility of using
this approach requires empirical analysis of its properties
for applications being considered.
To facilitate such analyses, we tried to do
three things: (1) clarify the statistical issues involved,
(2) provide an analytic framework for studying these issues,
and (3) use the framework to assess one potential application
of cluster assignment¾
evaluating educational programs targeted on whole schools.
Our empirical results suggest that cluster
assignment of schools holds some promise when it is possible
to control for either the past performance of individual students
(individual effects) or the average performance of recent
past student cohorts (school effects). These findings are
quite robust. They hold for two different grades (third-grade
and sixth-grade), two different subjects (math and reading)
and four different years (1989, 1990, 1991 and 1992).
On balance, we find that controlling for individual
effects improves statistical power by slightly more than controlling
for school effects. Controlling for more recent past
performance improves statistical power by slightly more than
controlling for less recent past performance. Controlling
for two years of past performance improves statistical power
by slightly more than controlling for one year of past performance.
But most importantly, all of these approaches improve statistical
power substantially.
Consequently, we project that if a good measure
of past individual or school performance is available, it
might be possible to detect a 3 to 6 percentile improvement
in average student performance with cluster assignment of
40 schools and 60 students per school (2,400 students overall).
This implies an effect size of roughly 0.10 to 0.20, which
by most existing standards suggests adequate statistical power.
Nevertheless, our findings represent only
one step towards a better understanding of the strengths and
limitations of cluster assignment. To explore this issue further,
the following issues must be considered.
Do our findings apply to other standardized
tests? We replicated key portions of our analysis
for outcome measures based on individual scores from the California
Achievement Test (CAT) in math administered to Rochester fifth-graders
during 1989, 1990, 1991 and 1992. These findings were consistent
with those reported for third-grade and sixth-grade PEP tests.
Will our findings apply when different
tests are used by schools in different years? Schools
often change the standardized test they use. Hence, the test
used for current students might differ from that used for
past students. To examine the implications of this possibility,
we analyzed current student math performance controlling for
the reading performance of previous cohorts, and vice versa.
Our results were quite similar to those presented above. Hence,
our findings are not sensitive to how past cohort performance
is measured.
Can our findings apply to a study conducted
in more than one city? Our findings reflect the variance
components of test scores for students and schools in one
city. Hence, they indicate what would happen if these schools
were randomly assigned to a program group or control group.
But more than one city might be necessary to recruit enough
eligible schools for an impact study. This could add a between-city
variance component to test scores. However, blocking schools
by city would eliminate this extra variance. For example,
one might recruit eight schools from each of five cities (40
schools) and randomly assign four schools from each city to
the program and four to the control group. Doing so would
remove all city-specific differences between the program group
and control group. Hence, this variance component would not
affect the statistical power of program impact estimates.
How sensitive is cluster assignment
to contamination of the treatment? When individuals
are randomly assigned to a program, some may not participate
and some of those assigned to the control group may inadvertently
receive program services.12
Consequently, the difference between program services received
by those assigned to the program and those assigned to the
control group is diluted and the measured program impact is
attenuated.
When clusters are randomly assigned, these
problems can be even more severe. If, for example, in a twenty-school
evaluation of a reform, two of the ten "program schools"
fail to implement the reform, and two of the "control
schools" develop a close alternative, ones ability
to identify program impacts can be lessened substantially.
It is not possible to correct such a problem by eliminating
the failed program schools or the control schools that adopt
a similar program, because doing so would compromise the experimental
research design. For those considering cluster assignment
this means that stringent control of the treatment is essential.
How sensitive is cluster assignment
to experimental attrition? Attrition from a study,
or failure to collect follow-up data on some sample members,
is potentially a more serious problem than contamination of
the treatment. Instead of diluting the treatment contrast,
such attrition (also referred to as "experimental mortality")
compromises the internal validity of the experimental design
because program group members and control group members for
whom follow-up data are available may represent non-random
sub-samples of the original program and control groups. Thus,
a decision by one school in a multi-school study to stop providing
data could seriously undermine the quality of a study and
the believability of its results.
This problem could be offset somewhat by grouping
clusters into blocks of two and randomly assigning one cluster
from each block to the program and one to the control group.
In this case, if one cluster dropped out of the study, the
other cluster from its block could be dropped as well. This
would reduce the sample size of the study but would no bias
its impact estimates.
How sensitive is cluster assignment
to outliers? Another source of problems for cluster
assignment is the possibility that unusual circumstances will
produce an aberration in the outcomes for a whole cluster.
In a study based on individual random assignment, a sample
member may win the lottery or go to prison. These rare events
could have a strong effect on the outcome for an individual
but they are unlikely to markedly affect the overall mean
outcome for program group members or control group members.
However, if a philanthropist decides to donate a large sum
of money to a control school in an educational experiment,
this donation might affect the experience of a large proportion
of the sample at once. Even though the standard error of impact
estimates accounts for these random events, they could make
any single impact estimate unbelievable.
Will our findings apply to other school
systems? Our findings represent the variance components
of elementary schools in Rochester, New York between 1989
and 1992. To the extent that the variance components of our
sample schools are similar to those in other school systems,
our findings will apply elsewhere. To the extent that Rochester
is idiosyncratic in this regard, our findings will not apply
elsewhere. The only way to answer this question is to replicate
our analysis for other school systems.
Will our findings for schools apply
to other types of clusters? The variance components
of clusters depend on how clusters are defined and the forces
causing individuals to group together in them. One likely
determinant is the geographic scope of each cluster. For example,
Census blocks (small geographic units) are probably more homogeneous
and differ more on average from each other than do Census
tracts (which comprise numerous Census blocks). Census tracts
are probably more homogeneous and differ more, on average,
from each other than do municipalities (which comprise many
Census tracts). This reflects the way that people concentrate
geographically. Hence, one cannot apply our findings directly
to comprehensive community initiatives, community health education
programs or other programs targeted on different types of
clusters. Nevertheless, it is possible to use our analytic
framework to examine the statistical properties of cluster
assignment for these applications.
1 More complex
designs randomly assign sample members to different treatment
groups or a control group (e.g. Freedman and Friedlander 1995).
2These approaches
have two other differences that are important to note but
lie outside the scope of the present paper. One difference
represents a limitation of cluster assignment; the other represents
an advantage.
The limitation stems from the fact that
cluster assignment cannot produce separate experimental impact
estimates for each site (cluster), whereas blocked assignment
can do so. However, this issue is not germane to the present
discussion about situations where blocked assignment is not
possible.
The advantage of cluster assignment stems
from its ability to capture program "macro-effects".
Macro-effects (Garfinkel et. al 1992 and Harris 1985) represent
changes in the environment at a site caused by a program that
influence outcomes for both control group members and program
group members. Hence, by comparing outcomes for these two
groups, blocked random assignment will miss these effects.
To date, however, there is very little evidence about the
existence of such macro-effects.
3Raudenbush (1997)
provides a framework for jointly determining the optimal number
and size of clusters. Our examples take cluster-size (school-size)
as given, but our approach can be generalized to decisions
about the number and size of clusters.
4There were 34
elementary schools in Rochester, all of which had a third
grade, but only 25 of which had a sixth grade. In order to
maintain the same sample of schools for our third-grade and
sixth grade analyses, we only report findings for the 25 schools
that had both grades. Further analyses, not reported here,
indicate that third-grade findings for all 34 schools are
similar to those for the 25 schools in our sample.
5See Bloom (1995)
for a discussion of minimum detectable effects and Cohen (1977,
1988) for a discussion of effect size.
6SAS PROC MIXED
combines the steps that we used into one procedure (SAS Institute
1997), but we kept them separate to reflect the intuition
of the process.
7Scores for first,
second, fourth and fifth grade were obtained from CAT, CT5
and DRP tests.
8Bloom 1995 illustrates
that the minimum detectable in terms of the standard normal
deviate is z0.95 plus the absolute value of z0.20
for this case.
9For each subject
and grade we used the mean of the standard deviations for
1989, 1990, 1991 and 1992.
10For sixth-graders,
the full samples sizes for 1991 math and reading and 1992
math and reading, respectively are 1,313, 1,314, 1,475 and
1,468; their counterparts for the longitudinal sub-sample
are 1,153, 1,153, 1,363 and 1,365. For third-graders, the
corresponding full-sample sizes are 1,815, 1,806, 1,794 and
1,797 and the corresponding longitudinal sub-sample sizes
are 1,545, 1,545, 1,754 and 1,754.
11The most noticeable
exception to this rule (for 1991 third-grade math scores)
is probably due to sampling error in the variance component
estimates. Close examination of this finding indicates that
although the total individual error variance (t2+s2)
was smaller for Model 9 than for Models 2 or 6, the allocation
of this total to each variance component was such that the
corresponding minimum detectable effect for Model 9 was slightly
larger than that for Model 2.
12Bloom 1984
provides a correction for program group members who do not
receive program services, which he refers to as "no-shows"
in experimental research. Haynes and Dantes 1987 provide a
similar correction for this problem in clinical trials, which
they refer to as "non-compliance". Bloom et al.
1997 provide a corresponding correction for both no-shows
and "cross-overs" (control group members who receive
program services).
Bloom, Howard S. (1984) "Accounting
for No-Shows in Experimental Evaluation Designs," Evaluation
Review Vol. 8, No. 2, April, 225-246.
Bloom, Howard S. (1995) "Minimum Detectable
Effects: A Simple Way to Report the Statistical Power of Experimental
Designs." Evaluation Review, Vol. 19, No. 5, pp.
547-556.
Bloom, Howard S., Larry L. Orr, Stephen H.
Bell, George Cave, Fred Doolittle, Winston Lin and Johannes
M. Bos (1997) "The Benefits and Costs of JTPA Title II-A
Programs: Key Findings from the National Job Training Partnership
Study," The Journal of Human Resources, Vol. 32,
No. 3, Summer, pp. 549-576.
Cochran, W. G. (1963) Sampling Techniques
(New York: Wiley).
Cohen, J. (1977) Statistical Power Analysis
for the Behavioral Sciences, rev. ed. (New York: Academic
Press).
Cohen, J. (1988) Statistical Power Analysis
for the Behavioral Sciences, 2nd. ed. (Hillsdale,
NJ: Lawrence Erlbaum).
Connell, James P., Anne C. Kubisch, Lisbeth
B. Schorr and Carol Weiss eds. (1995) New Approaches
to Evaluating Community Initiatives: Concepts, Methods and
Contexts (Washington, DC: The Aspen Institute).
Freedman, Stephen, and Daniel Friedlander
(1995). The JOBS Evaluation: Early Findings on Program
Impacts in Three Sites. (U.S. Department of Health and
Human Services. Administration for Children and Families,
Office of the Assistant Secretary for Planning and Evaluation).
Garfinkel, Irwin, Charles F. Manski and Charles
Michalopoulos (1992) "Micro Experiments and Macro Effects"
in Charles F. Manski and Irwin Garfinkel eds. Evaluating
Welfare and Training Programs (Cambridge, MA: Harvard
University Press).
Glass, G.V., B. McGaw and M.L. Smith (1981)
Meta-analysis in Social Research (Beverly Hills, CA:
Sage Publications).
Greenberg, David and Philip Robins (1986).
"The Changing Role of Social Experiments in Policy Analysis."
Journal of Policy Analysis and Management. 5(2):340-652.
Jeffrey Harris (1985) "Macroexperiments
versus Microexperiments for Health Policy" in Jerry A.
Hausman and David A. Wise eds. Social Experimentation
(Chicago IL: University of Chicago Press).
Haynes, R. Brian and Renato Dantes (1987)
"Patient Compliance and the Conduct and Interpretation
of Therapeutic Trials" Controlled Clinical Trials
8 (1, March): 12-19.
Hollister, Robinson G. and Jennifer Hill (1995)
"Problems in the Evaluation of Community-Wide Initiatives"
in James Connell, Anne C. Kubisch, Lisbeth B. Schorr and Carol
H. Weiss eds. New Approaches to Evaluating Community Initiatives:
Concepts, Methods, and Contexts (Washington, DC: The Aspen
Institute), pp. 127-172.
Lasoff, Melanie, Lynn Olson, and Meg Sommerfeld
(1994) "School-Reform Networks at a Glance". Education
Week. November 2, 1994.
Lipsey, Mark (1990) Design Sensitivity:
Statistical Power for Experimental Research (Newbury Park,
CA: Sage Publications).
Murray, David M., Peter J. Hannan, David R.
Jacobs, Paul J. McGovern, Linda Schmid, William L. Baker and
Clifton Gray (1994) "Assessing Intervention Effects in
the Minnesota Heart Health Program" American Journal
of Epidemiology Vol. 139, No. 1, pp. 91-103.
Raudenbush, Stephen W. (1997) "Statistical
Analysis and Optimal Design in Cluster Randomized Trials"
Psychological Methods, Vol. 2, No. 2, pp. 173-185.
SAS Institute Inc. 1997 "Chapter 18:
The MIXED Procedure" SAS/STAT Software: Changes and
Enhancements Through Release 6.12 (Cary, NC: SAS Institute),
p. 577.
SAS Institute Inc.1989. "Chapter 44:
The VARCOMP Procedure," SAS/STAT User's Guide Volume
2 Version 6, (Cary, NC : SAS Institute), pp. 1661-1667.
|