OJPHI: Vol. 5
Journal Information
Journal ID (publisher-id): OJPHI
ISSN: 1947-2579
Publisher: University of Illinois at Chicago Library
Article Information
©2013 the author(s)
open-access: This is an Open Access article. Authors own copyright of their articles appearing in the Online Journal of Public Health Informatics. Readers may copy articles without permission of the copyright owner(s), as long as the author and OJPHI are acknowledged in the copy and the copy is used for educational, not-for-profit purposes.
Electronic publication date: Day: 4 Month: 4 Year: 2013
collection publication date: Year: 2013
Volume: 5E-location ID: e189
Publisher Id: ojphi-05-189

Collaborative Automation Reliably Remediating Erroneous Conclusion Threats (CARRECT)
Jonathan C. Lansey*1
Paul Picciano1
Ian Yohai1
Fred Grant2
Robert Gern2
1Aptima inc., Woburn, MA, USA;
2Northrop Grumman Corporation, Falls Church, VA, USA
*Jonathan C. Lansey, E-mail: jlansey@aptima.com


The objective of the CARRECT software is to make cutting edge statistical methods for reducing bias in epidemiological studies easy to use and useful for both novice and expert users.


Analyses produced by epidemiologists and public health practitioners are susceptible to bias from a number of sources including missing data, confounding variables, and statistical model selection. It often requires a great deal of expertise to understand and apply the multitude of tests, corrections, and selection rules, and these tasks can be time-consuming and burdensome. To address this challenge, Aptima began development of CARRECT, the Collaborative Automation Reliably Remediating Erroneous Conclusion Threats system. When complete, CARRECT will provide an expert system that can be embedded in an analyst’s workflow. CARRECT will support statistical bias reduction and improved analyses and decision making by engaging the user in a collaborative process in which the technology is transparent to the analyst.


Older approaches to imputing missing data, including mean imputation and single imputation regression methods, have steadily given way to a class of methods known as “multiple imputation” (hereafter “MI”; Rubin 1987). Rather than making the restrictive assumption that the data are missing completely at random (MCAR), MI typically assumes the data are missing at random (MAR).

There are two key innovations behind MI. First, the observed values can be useful in predicting the missing cells, and thus specifying a joint distribution of the data is the first step in implementing the models. Second, single imputation methods will likely fail not only because of the inherent uncertainty in the missing values but also because of the estimation uncertainty associated with generating the parameters in the imputation procedure itself. By contrast, drawing the missing values multiple times, thereby generating m complete datasets along with the estimated parameters of the model properly accounts for both types of uncertainty (Rubin 1987; King et al. 2001). As a result, MI will lead to valid standard errors and confidence intervals along with unbiased point estimates.

In order to compute the joint distribution, CARRECT uses a bootstrapping-based algorithm that gives essentially the same answers as the standard Bayesian Markov Chain Monte Carlo (MCMC) or Expectation Maximization (EM) approaches, is usually considerably faster than existing approaches and can handle many more variables.


Tests were conducted on one of the proposed methods with an epidemiological dataset from the Integrated Health Interview Series (IHIS) producing verifiably unbiased results despite high missingness rates. In addition, mockups (Figure 1) were created of an intuitive data wizard that guides the user through the analysis processes by analyzing key features of a given dataset. The mockups also show prompts for the user to provide additional substantive knowledge to improve the handling of imperfect datasets, as well as the selection of the most appropriate algorithms and models.


Our approach and program were designed to make bias mitigation much more accessible to much more than only the statistical elite. We hope that it will have a wide impact on reducing bias in epidemiological studies and provide more accurate information to policymakers.


This material is based upon work supported by the Walter Reed Army Institute of Research (WRAIR) under Contract No. W81XWH-11-C-0505. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the WRAIR.

Honaker, James; King, Gary. “What to do About Missing Values in Time Series Cross-Section Data”American Journal of Political Science 54(2)April;2010 :561–581.
King, Gary; Honaker, James; Joseph, Anne; Scheve, Kenneth. “Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation”American Political Science Review 95(1)March;2001 :49–69.

[Figure ID: f1-ojphi-05-189]
Figure 1 

Screenshot of user selecting imputation parameters.

Article Categories:
  • ISDS 2012 Conference Abstracts

Keywords: Bias reduction, Missing data, Statistical model selection.

Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org