Nonparametric Bayesian Multiple Imputation for Missing Data Due to Mid-Study Switching of Measurement Methods

Investigators often change how variables are measured during the middle of data-collection, for example, in hopes of obtaining greater accuracy or reducing costs. The resulting data comprise sets of observations measured on two (or more) different scales, which complicates interpretation and can create bias in analyses that rely directly on the differentially measured variables. We develop approaches based on multiple imputation for handling mid-study changes in measurement for settings without calibration data, that is, no subjects are measured on both (all) scales. This setting creates a seemingly insurmountable problem for multiple imputation: since the measurements never appear jointly, there is no information in the data about their association. We resolve the problem by making an often scientifically reasonable assumption that each measurement regime accurately ranks the samples but on differing scales, so that, for example, an individual at the qth percentile on one scale should be at about the qth percentile on the other scale. We use rank-preservation assumptions to develop three imputation strategies that flexibly transform measurements made in one scale to measurements made in another: a Markov chain Monte Carlo (MCMC)-free approach based on permuting ranks of measurements, and two approaches based on dependent Dirichlet process (DDP) mixture models for imputing values conditional on covariates. We use simulations to illustrate conditions under which each strategy performs well, and present guidance on when to apply each. We apply these methods to a study of birth outcomes in which investigators collected mothers’ blood samples to measure levels of environmental contaminants. Midway through data ascertainment, the study switched from one analytical lab to another. The distributions of blood lead levels differ greatly across the two labs, suggesting that the labs report measurements according to different scales. We use nonparametric Bayesian imputation models to obtain sets of plausible measurements on a common scale, and estimate quantile regressions of birth weight on various environmental contaminants.

[1]  Valerie M. Thomas,et al.  The Elimination of Lead in Gasoline , 1995 .

[2]  R. Koenker,et al.  Regression Quantiles , 2007 .

[3]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[4]  Valen E. Johnson,et al.  On Bayesian Analysis of Multirater Ordinal Data: An Application to Automated Essay Grading , 1996 .

[5]  Jerome P. Reiter,et al.  Exploratory quantile regression with many covariates: an application to adverse birth outcomes. , 2011, Epidemiology.

[6]  Lancelot F. James,et al.  Approximate Dirichlet Process Computing in Finite Normal Mixtures , 2002 .

[7]  D. Dunson,et al.  Kernel stick-breaking processes. , 2008, Biometrika.

[8]  T. Raghunathan,et al.  An Evaluation of Matrix Sampling Methods Using Data from the National Health and Nutrition Examination Survey , 2006 .

[9]  J. Pounds,et al.  Estimation of cumulative lead releases (lead flux) from the maternal skeleton during pregnancy and lactation. , 1999, The Journal of laboratory and clinical medicine.

[10]  Nathaniel Schenker,et al.  From single‐race reporting to multiple‐race reporting: using imputation methods to bridge the transition , 2003, Statistics in medicine.

[11]  P. Müller,et al.  Bayesian curve fitting using multivariate normal mixtures , 1996 .

[12]  S. van Buuren,et al.  Flexible mutlivariate imputation by MICE , 1999 .

[13]  N. Pillai,et al.  Bayesian density regression , 2007 .

[14]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[15]  Gabriele B. Durrant,et al.  Using missing data methods to correct for measurement error in a distribution function , 2006 .

[16]  S. Cnattingius,et al.  The paradoxical effect of smoking in preeclamptic pregnancies: smoking reduces the incidence but increases the rates of perinatal mortality, abruptio placentae, and intrauterine growth restriction. , 1997, American journal of obstetrics and gynecology.

[17]  J. E. Griffin,et al.  Order-Based Dependent Dirichlet Processes , 2006 .

[18]  Marina Vannucci,et al.  Variable Selection for Nonparametric Gaussian Process Priors: Models and Computational Strategies. , 2011, Statistical science : a review journal of the Institute of Mathematical Statistics.

[19]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[20]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[21]  Sander Greenland,et al.  Multiple-imputation for measurement-error correction. , 2006, International journal of epidemiology.

[22]  D. Jacobs,et al.  Validation of a 20-year forecast of US childhood lead poisoning: Updated prospects for 2010. , 2006, Environmental research.

[23]  S. MacEachern,et al.  Bayesian Nonparametric Spatial Modeling With Dirichlet Process Mixing , 2005 .

[24]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[25]  Jerome P. Reiter,et al.  Multiple imputation for missing data via sequential regression trees. , 2010, American journal of epidemiology.

[26]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[27]  T. Hedner,et al.  Smoking affects blood pressure. , 1996, Blood pressure.

[28]  E. Guallar,et al.  Lead Exposure and Cardiovascular Disease—A Systematic Review , 2006, Environmental health perspectives.

[29]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[30]  Elizabeth A Stuart,et al.  Matching methods for causal inference: A review and a look forward. , 2010, Statistical science : a review journal of the Institute of Mathematical Statistics.

[31]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[32]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[33]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[34]  Kassandra Fronczyk,et al.  A Bayesian Nonparametric Modeling Framework for Developmental Toxicity Studies , 2014 .

[35]  David A. Jaeger Reconciling the Old and New Census Bureau Education Questions: Recommendations for Researchers , 1997 .

[36]  R. Koenker,et al.  Robust Tests for Heteroscedasticity Based on Regression Quantiles , 1982 .

[37]  Sharon E. Edwards,et al.  Disparities in Maternal Hypertension and Pregnancy Outcomes: Evidence from North Carolina, 1994–2003 , 2010, Public health reports.

[38]  Jayaran Sethuramant A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[39]  S. MacEachern,et al.  An ANOVA Model for Dependent Random Measures , 2004 .