Researchers are often interested in testing the effectiveness of an intervention on multiple outcomes, for multiple subgroups, at multiple points in time, or across multiple treatment groups. The resulting multiplicity of statistical hypothesis tests can increase the likelihood of spurious findings: that is, finding statistically significant effects that do not in fact exist. Without the use of a multiple testing procedure (MTP) to counteract this problem, the probability of false positive findings increases, sometimes dramatically, with the number of tests. Yet the use of an MTP can result in a substantial change in statistical power, greatly reducing the probability of detecting effects when they do exist.
The Subprime Lending Data Exploration Project is a “big data” project designed to produce policy-relevant insights using an administrative data set that covers nearly 50 million individuals who have applied for or used subprime credit. The data set contains information on borrower demographics, loan types and terms, account types and balances, and repayment histories. To investigate whether there were distinct groups of borrowers in terms of loan usage patterns and outcomes, we used a data discovery process called K-means clustering.
Across policy domains, practitioners and researchers are benefiting from a trend of greater access to both more detailed and frequent data and the increased computing power needed to work with large, longitudinal data sets. There is growing interest in using such data as a case management tool, to better understand patterns of behavior, better manage caseload dynamics, and better target individuals for interventions. In particular, predictive analytics — which has long been used in business and marketing research — is gaining currency as a way for social service providers to identify individuals who are at risk of adverse outcomes. MDRC has multiple predictive analytics efforts under way, which we summarize here while highlighting our methodological approach.