Using a mixture model for multiple imputation in the presence of outliers: the ‘Healthy for life’ project

We consider the problem of obtaining population-based inference in the presence of missing data and outliers in the context of estimating the prevalence of obesity and body mass index measures from the 'Healthy for life' study. Identifying multiple outliers in a multivariate setting is problematic because of problems such as masking, in which groups of outliers inflate the covariance matrix in a fashion that prevents their identification when included, and swamping, in which outliers skew covariances in a fashion that makes non-outlying observations appear to be outliers. We develop a latent class model that assumes that each observation belongs to one of "K" unobserved latent classes, with each latent class having a distinct covariance matrix. We consider the latent class covariance matrix with the largest determinant to form an 'outlier class'. By separating the covariance matrix for the outliers from the covariance matrices for the remainder of the data, we avoid the problems of masking and swamping. As did Ghosh-Dastidar and Schafer, we use a multiple-imputation approach, which allows us simultaneously to conduct inference after removing cases that appear to be outliers and to promulgate uncertainty in the outlier status through the model inference. We extend the work of Ghosh-Dastidar and Schafer by embedding the outlier class in a larger mixture model, consider penalized likelihood and posterior predictive distributions to assess model choice and model fit, and develop the model in a fashion to account for the complex sample design. We also consider the repeated sampling properties of the multiple imputation removal of outliers. Copyright 2007 Royal Statistical Society.

[1]  T. Cole The LMS method for constructing normalized growth standards. , 1990, European journal of clinical nutrition.

[2]  A. Brix Bayesian Data Analysis, 2nd edn , 2005 .

[3]  H. Akaike A Bayesian analysis of the minimum AIC procedure , 1978 .

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Katherine M Flegal,et al.  Prevalence and trends in overweight among US children and adolescents, 1999-2000. , 2002, JAMA.

[6]  B. Carlin,et al.  Bayesian Model Choice Via Markov Chain Monte Carlo Methods , 1995 .

[7]  K I Penny,et al.  Multivariate outlier detection applied to multiply imputed laboratory data. , 1999, Statistics in medicine.

[8]  J. Schafer,et al.  Multiple Edit/Multiple Imputation for Multivariate Continuous Data , 2003 .

[9]  Jeffrey B. Schwimmer,et al.  Preventing Childhood Obesity: Health in the Balance , 2005, Environmental Health Perspectives.

[10]  N. Stettler,et al.  High Prevalence of Overweight Among Pediatric Users of Community Health Centers , 2005, Pediatrics.

[11]  J. Shults,et al.  Long-term, high-dose glucocorticoids and bone mineral content in childhood glucocorticoid-sensitive nephrotic syndrome. , 2004, The New England journal of medicine.

[12]  Xiao-Li Meng,et al.  Multiple-Imputation Inferences with Uncongenial Sources of Input , 1994 .

[13]  T J Cole,et al.  Growth charts for both cross-sectional and longitudinal data. , 1994, Statistics in medicine.

[14]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[15]  Setting Brazil Establishing a standard definition for child overweight and obesity worldwide: international survey. , 2008 .

[16]  M. Stephens Dealing with label switching in mixture models , 2000 .

[17]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[18]  H. Teicher Identifiability of Finite Mixtures , 1963 .

[19]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[20]  R. Little Robust Estimation of the Mean and Covariance Matrix from Data with Missing Values , 1988 .

[21]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[22]  B. Graubard,et al.  Latent Class Analysis of Complex Sample Survey Data , 2002 .

[23]  Blossom H. Patterson,et al.  Latent Class Analysis of Complex Sample Survey Data , 2002 .

[24]  Kim-Hung Li,et al.  Imputation using Markov chains , 1988 .

[25]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[26]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[27]  R. Woodruff A Simple Method for Approximating the Variance of a Complicated Estimate , 1971 .

[28]  D. G. Simpson,et al.  Unmasking Multivariate Outliers and Leverage Points: Comment , 1990 .

[29]  R. Little,et al.  Editing and Imputation for Quantitative Survey Data , 1987 .

[30]  K. Flegal,et al.  Prevalence of overweight and obesity among US children, adolescents, and adults, 1999-2002. , 2004, JAMA.

[31]  Shumei S. Guo,et al.  2000 CDC Growth Charts for the United States: methods and development. , 2002, Vital and health statistics. Series 11, Data from the National Health Survey.

[32]  K. Flegal,et al.  Prevalence and Trends in Overweight among Us , 2022 .

[33]  S. Y. Kimm,et al.  Childhood obesity: a new pandemic of the new millennium. , 2002, Pediatrics.

[34]  A F Roche,et al.  CDC growth charts: United States. , 2000, Advance data.

[35]  N. Campbell Robust Procedures in Multivariate Analysis I: Robust Covariance Estimation , 1980 .

[36]  Sonia Caprio,et al.  Obesity and the metabolic syndrome in children and adolescents. , 2004, The New England journal of medicine.

[37]  K. Chaloner,et al.  A Bayesian approach to outlier detection and residual analysis , 1988 .

[38]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[39]  M. Elliott Multiple Imputation in the Presence of Outliers , 2006 .

[40]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[41]  Xiao-Li Meng,et al.  POSTERIOR PREDICTIVE ASSESSMENT OF MODEL FITNESS VIA REALIZED DISCREPANCIES , 1996 .

[42]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[43]  A. Hadi Identifying Multiple Outliers in Multivariate Data , 1992 .

[44]  M. J. Bayarri,et al.  Bayesian measures of surprise for outlier detection , 2003 .