Using Principal Components as Auxiliary Variables in Missing Data Estimation

To deal with missing data that arise due to participant nonresponse or attrition, methodologists have recommended an “inclusive” strategy where a large set of auxiliary variables are used to inform the missing data process. In practice, the set of possible auxiliary variables is often too large. We propose using principal components analysis (PCA) to reduce the number of possible auxiliary variables to a manageable number. A series of Monte Carlo simulations compared the performance of the inclusive strategy with eight auxiliary variables (inclusive approach) to the PCA strategy using just one principal component derived from the eight original variables (PCA approach). We examined the influence of four independent variables: magnitude of correlations, rate of missing data, missing data mechanism, and sample size on parameter bias, root mean squared error, and confidence interval coverage. Results indicate that the PCA approach results in unbiased parameter estimates and potentially more accuracy than the inclusive approach. We conclude that using the PCA strategy to reduce the number of auxiliary variables is an effective and practical way to reap the benefits of the inclusive strategy in the presence of many possible auxiliary variables.

[1]  Craig K. Enders,et al.  Missing Data in Educational Research: A Review of Reporting Practices and Suggestions for Improvement , 2004 .

[2]  Sabrina Eberhart,et al.  Applied Missing Data Analysis , 2016 .

[3]  Craig K. Enders,et al.  A Note on the Use of Missing Auxiliary Variables in Full Information Maximum Likelihood-Based Structural Equation Models , 2008 .

[4]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[5]  D. Rubin,et al.  Fully conditional specification in multivariate imputation , 2006 .

[6]  Daniel A. Newman Longitudinal Modeling with Randomly and Systematically Missing Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques , 2003 .

[7]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[8]  Roderick Little,et al.  Calibrated Bayes, for Statistics in General, and Missing Data in Particular , 2011, 1108.1917.

[9]  T. W. Anderson Maximum Likelihood Estimates for a Multivariate Normal Distribution when Some Observations are Missing , 1957 .

[10]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[11]  Craig K. Enders,et al.  Multiple Imputation Strategies for Multiple Group Structural Equation Models , 2011 .

[12]  R. Little A Test of Missing Completely at Random for Multivariate Data with Missing Values , 1988 .

[13]  T. Brown,et al.  Confirmatory Factor Analysis for Applied Research , 2006 .

[14]  Donald B. Rubin,et al.  A Note on Bayesian, Likelihood, and Sampling Distribution Inferences , 1978 .

[15]  P. R. Shearer,et al.  Missing Data in Quantitative Designs , 1973 .

[16]  Gayle J. Luze,et al.  Developing a General Outcome Measure of Growth in the Expressive Communication of Infants and Toddlers , 2001 .

[17]  Jin Eun Yoo The Effect of Auxiliary Variables and Multiple Imputation on Parameter Estimation in Confirmatory Factor Analysis , 2009 .

[18]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[19]  Frank Yates,et al.  Incomplete Latin squares , 1936, The Journal of Agricultural Science.

[20]  P. Allison Estimation of Linear Models with Incomplete Data , 1987 .

[21]  Craig K. Enders,et al.  A Primer on Maximum Likelihood Algorithms Available for Use With Missing Data , 2001 .

[22]  K. Jöreskog Simultaneous factor analysis in several populations , 1971 .

[23]  G. Dunteman Principal Components Analysis , 1989 .

[24]  Frank Yates,et al.  The analysis of Latin squares when two or more rows, columns, or treatments are missing , 1939 .

[25]  Craig K Enders,et al.  A primer on the use of modern missing-data methods in psychosomatic medicine research. , 2006, Psychosomatic medicine.

[26]  R Hardy,et al.  Methods for handling missing data , 2009 .

[27]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[28]  Craig K Enders,et al.  Using the expectation maximization algorithm to estimate coefficient alpha for scales with item-level missing data. , 2003, Psychological methods.

[29]  J. Graham,et al.  Missing data analysis: making it work in the real world. , 2009, Annual review of psychology.

[30]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[31]  P. Allison Missing data techniques for structural equation modeling. , 2003, Journal of abnormal psychology.

[32]  Roger L. Brown Efficacy of the indirect approach for estimating structural equation models with missing data: A comparison of five methods , 1994 .

[33]  Jay Buzhardt,et al.  Program-Level Influences on the Measurement of Early Communication for Infants and Toddlers in Early Head Start , 2011 .

[34]  Jay Buzhardt,et al.  Evidence of a Continuum in Foundational Expressive Communication Skills. , 2013, Early childhood research quarterly.

[35]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[36]  S. S. Wilks Moments and Distributions of Estimates of Population Parameters from Fragmentary Samples , 1932 .

[37]  Rex B. Kline,et al.  Principles and Practice of Structural Equation Modeling , 1998 .

[38]  William J. Browne,et al.  A User's Guide To Mlwin , 2015 .

[39]  Paul T. von Hippel,et al.  HOW TO IMPUTE INTERACTIONS, SQUARES, AND OTHER TRANSFORMED VARIABLES , 2009 .

[40]  R. Little,et al.  Does Weighting for Nonresponse Increase the Variance of Survey Means? (Conference Paper) , 2004 .

[41]  Lynn S. Fuchs,et al.  The Past, Present, and Future of Curriculum-Based Measurement Research , 2004 .

[42]  Michael H. Kutner Applied Linear Statistical Models , 1974 .

[43]  Hakan Demirtas,et al.  Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: a simulation assessment , 2008 .

[44]  Trivellore E Raghunathan,et al.  What do we do with missing data? Some options for analysis of incomplete data. , 2004, Annual review of public health.

[45]  M. Woodbury A missing information principle: theory and applications , 1972 .

[46]  R A Fisher,et al.  The Genetical Interpretation of Statistics of the Third Degree in the Study of Quantitative Inheritance. , 1932, Genetics.

[47]  M. J. Norušis,et al.  SPSS base system user's guide , 1990 .

[48]  J. Wishart,et al.  A Method of Estimating the Yield of a Missing Plot in Field Experimental Work , 1930, The Journal of Agricultural Science.

[49]  Denis Conniffe,et al.  R.A. Fisher and the development of statistics - a view in his centerary year , 1991 .

[50]  Theodore M. Porter,et al.  Karl Pearson: The Scientific Life in a Statistical Age , 2004 .

[51]  Craig K. Enders,et al.  Applying the Bollen-Stine Bootstrap for Goodness-of-Fit Measures to Structural Equation Models with Missing Data , 2002, Multivariate behavioral research.

[52]  Martin A. Tanner,et al.  From EM to Data Augmentation: The Emergence of MCMC Bayesian Computation in the 1980s , 2010, 1104.2210.

[53]  D. N. Hunt,et al.  Iterative Missing Value Estimation , 1989 .

[54]  Craig K. Enders,et al.  The Performance of the Full Information Maximum Likelihood Estimator in Multiple Regression Models with Missing Data , 2001 .

[55]  D. Rubin Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys , 1977 .

[56]  Carl T. Finkbeiner Estimation for the multiple factor model when data are missing , 1979 .

[57]  Donald B. Rubin,et al.  A Non‐Iterative Algorithm for Least Squares Estimation of Missing Values in Any Analysis of Variance Design , 1972 .

[58]  J. Graham Adding Missing-Data-Relevant Variables to FIML-Based Structural Equation Models , 2003 .

[59]  C. H. Goulden,et al.  The Recovery of Inter-Block Information in Quasi-Factorial Designs with Incomplete Data. , 1946 .

[60]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[61]  M. S. Bartlett,et al.  Some examples of statistical methods of research in agriculture and applied biology , 1937 .

[62]  Herbert W. Marsh,et al.  Pairwise Deletion for Missing Data in Structural Equation Models: Nonpositive Definite Matrices, Parameter Estimates, Goodness of Fit, and Adjusted Sample Sizes. , 1998 .

[63]  M. J. R. Healy Frank Yates, 1902-1994 : the work of a statistician , 1995 .

[64]  Rainer Leonhart,et al.  Auxiliary variables in multiple imputation in regression with missing X: a warning against including too many in small sample research , 2012, BMC Medical Research Methodology.

[65]  Craig K. Enders,et al.  Using an EM Covariance Matrix to Estimate Structural Equation Models With Missing Data: Choosing an Adjusted Sample Size to Improve the Accuracy of Inferences , 2004 .

[66]  A. W. F. Edwards,et al.  Ronald Aylmer Fisher , 1990 .

[67]  K. Widaman Common Factors Versus Components: Principals and Principles, Errors and Misconceptions , 2007 .

[68]  R. R. Hocking,et al.  The analysis of incomplete data. , 1971 .

[69]  D. G. Herr On the History of ANOVA in Unbalanced, Factorial Designs: The First 30 Years , 1986 .

[70]  Anders Hald,et al.  On the history of maximum likelihood in relation to inverse probability and least squares , 1999 .

[71]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[72]  Duan Zhang,et al.  A Monte Carlo investigation of robustness to nonnormal incomplete data of multilevel modeling , 2006 .

[73]  Thomas Hawkins,et al.  Cauchy and the spectral theory of matrices , 1975 .

[74]  Geert Molenberghs,et al.  Missing Data in Clinical Studies , 2007 .

[75]  Geert Molenberghs,et al.  Incomplete hierarchical data , 2007, Statistical methods in medical research.

[76]  John W. Graham,et al.  Analysis With Missing Data in Prevention Research , 1997 .

[77]  Karl Pearson,et al.  Karl Pearson : an appreciation of some aspects of his life and work , 1939 .

[78]  W. Wothke Longitudinal and multigroup modeling with missing data. , 2000 .

[79]  James L. Arbuckle,et al.  Full Information Estimation in the Presence of Incomplete Data , 1996 .

[80]  Craig K. Enders,et al.  An introduction to modern missing data analyses. , 2010, Journal of school psychology.

[81]  Patricio Cumsille,et al.  4 Methods for Handling Missing Data , 2012 .

[82]  Kazunori Yamaguchi,et al.  The EM algorithm and related statistical models , 2003 .

[83]  J. Graham,et al.  How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory , 2007, Prevention Science.

[84]  F. Yates The analysis of replicated experiments when the field results are incomplete , 1933 .

[85]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[86]  J. Schafer,et al.  A comparison of inclusive and restrictive strategies in modern missing data procedures. , 2001, Psychological methods.

[87]  Craig K. Enders,et al.  The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models , 2001 .

[88]  Patricia Goodson,et al.  Out of sight, not out of mind: strategies for handling missing data. , 2008, American journal of health behavior.

[89]  A. E. Brandt,et al.  The Analysis of Variance in a “2×s” Table with Disproportionate Frequencies , 1933 .

[90]  Kristopher J Preacher,et al.  Repairing Tom Swift's Electric Factor Analysis Machine , 2003 .

[91]  Ruth G. Shaw,et al.  Anova for Unbalanced Data: An Overview , 1993 .

[92]  Y. Dodge Analysis of Experiments with Missing Data , 1985 .

[93]  Frank Yates Theory and Practice in Statistics , 1968 .

[94]  Nicholas J. Horton,et al.  A Potential for Bias When Rounding in Multiple Imputation , 2003 .

[95]  Thomas D. Wickens,et al.  The geometry of multivariate statistics , 1994 .