The E-MS Algorithm: Model Selection With Incomplete Data

We propose a procedure associated with the idea of the E-M algorithm for model selection in the presence of missing data. The idea extends the concept of parameters to include both the model and the parameters under the model, and thus allows the model to be part of the E-M iterations. We develop the procedure, known as the E-MS algorithm, under the assumption that the class of candidate models is finite. Some special cases of the procedure are considered, including E-MS with the generalized information criteria (GIC), and E-MS with the adaptive fence (AF; Jiang et al.). We prove numerical convergence of the E-MS algorithm as well as consistency in model selection of the limiting model of the E-MS convergence, for E-MS with GIC and E-MS with AF. We study the impact on model selection of different missing data mechanisms. Furthermore, we carry out extensive simulation studies on the finite-sample performance of the E-MS with comparisons to other procedures. The methodology is also illustrated on a real data analysis involving QTL mapping for an agricultural study on barley grains. Supplementary materials for this article are available online.

[1]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[2]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[3]  J. Ibrahim,et al.  Model Selection Criteria for Missing-Data Problems Using the EM Algorithm , 2008, Journal of the American Statistical Association.

[4]  P. Diggle,et al.  Analysis of Longitudinal Data , 2003 .

[5]  Donald E. Myers,et al.  Linear and Generalized Linear Mixed Models and Their Applications , 2008, Technometrics.

[6]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[7]  M. C. Bueso,et al.  Stochastic complexity and model selection from incomplete data , 1999 .

[8]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[9]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[10]  R. Nishii Asymptotic Properties of Criteria for Selection of Variables in Multiple Regression , 1984 .

[11]  George E. P. Box,et al.  Some Problems of Statistics and Everyday Life , 1979 .

[12]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[13]  R. Waugh,et al.  SFP Genotyping From Affymetrix Arrays Is Robust But Largely Detects Cis-acting Expression Regulators , 2007, Genetics.

[14]  S. Müller,et al.  Model Selection in Linear Mixed Models , 2013, 1306.2427.

[15]  J. Ibrahim,et al.  Fixed and Random Effects Selection in Mixed Effects Models , 2011, Biometrics.

[16]  Gerda Claeskens,et al.  Variable Selection with Incomplete Covariate Data , 2007, Biometrics.

[17]  James M. Robins,et al.  Semiparametric Regression for Repeated Outcomes With Nonignorable Nonresponse , 1998 .

[18]  R. R. Hocking,et al.  The analysis of incomplete data. , 1971 .

[19]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[20]  R. Shibata Approximate efficiency of a selection procedure for the number of regression variables , 1984 .

[21]  D. Rubin,et al.  Fully conditional specification in multivariate imputation , 2006 .

[22]  Ernst Wit,et al.  Local model uncertainty and incomplete-data bias , 2005 .

[23]  R. Jansen,et al.  Interval mapping of multiple quantitative trait loci. , 1993, Genetics.

[24]  Z B Zeng,et al.  Theoretical basis for separation of multiple linked gene effects in mapping quantitative trait loci. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[25]  E. Lander,et al.  Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. , 1989, Genetics.

[26]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[27]  Jiming Jiang,et al.  A unified jackknife theory for empirical best prediction with M-estimation , 2002 .

[28]  Michael Schomaker,et al.  Frequentist Model Averaging with missing observations , 2010, Comput. Stat. Data Anal..

[29]  C. Fuchs Maximum Likelihood Estimation and Model Selection in Contingency Tables with Missing Data , 1982 .

[30]  Geert Molenberghs,et al.  Formal and Informal Model Selection with Incomplete Data. , 2008, 0808.3587.

[31]  S. Knapp,et al.  Quantitative trait locus effects and environmental interaction in a sample of North American barley germ plasm , 1993, Theoretical and Applied Genetics.

[32]  Hidetoshi Shimodaira A new criterion for selecting models from partially observed data , 1994 .

[33]  R. Elashoff,et al.  Missing Observations in Multivariate Statistics I. Review of the Literature , 1966 .

[34]  J. Robins,et al.  Analysis of semiparametric regression models for repeated outcomes in the presence of missing data , 1995 .

[35]  C. Raghavendra Rao,et al.  On model selection , 2001 .

[36]  Jiming Jiang Wald consistency and the method of sieves in REML estimation , 1997 .

[37]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[38]  Karl W. Broman,et al.  A model selection approach for the identification of quantitative trait loci in experimental crosses , 2002 .

[39]  Maurizio Dapor Monte Carlo Strategies , 2020, Transport of Energetic Electrons in Solids.

[40]  Abd-Krim Seghouane,et al.  A criterion for model selection in the presence of incomplete data based on Kullback's symmetric divergence , 2005, Signal Process..

[41]  Hongtu Zhu,et al.  VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA. , 2010, Statistica Sinica.

[42]  Jiming Jiang,et al.  Fence method for nonparametric small area estimation , 2010 .

[43]  G Molenberghs,et al.  Model selection for incomplete and design‐based samples , 2006, Statistics in medicine.

[44]  J. S. Rao,et al.  Best Predictive Small Area Estimation , 2011 .

[45]  N. Hjort,et al.  The Focused Information Criterion , 2003 .

[46]  Jiming Jiang,et al.  Fence methods for backcross experiments , 2014, Journal of statistical computation and simulation.

[47]  D. V. Dyk NESTING EM ALGORITHMS FOR COMPUTATIONAL EFFICIENCY , 2000 .

[48]  S. Müller,et al.  On Model Selection Curves , 2010 .

[49]  Bingqing Lin,et al.  Fixed and Random Effects Selection by REML and Pathwise Coordinate Optimization , 2013, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[50]  Xin Chen,et al.  A stochastic expectation and maximization algorithm for detecting quantitative trait-associated genes , 2011, Bioinform..

[51]  Thuan Nguyen,et al.  The Fence Methods , 2015 .

[52]  H. Bondell,et al.  Joint Variable Selection for Fixed and Random Effects in Linear Mixed‐Effects Models , 2010, Biometrics.

[53]  Shinto Eguchi,et al.  Local model uncertainty and incomplete‐data bias (with discussion) , 2005 .

[54]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[55]  J. Booth,et al.  Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm , 1999 .

[56]  J. Cavanaugh,et al.  An Akaike information criterion for model selection in the presence of incomplete data , 1998 .

[57]  J. S. Rao,et al.  Fence methods for mixed model selection , 2008, 0808.0985.

[58]  Samuel Müller,et al.  Outlier Robust Model Selection in Linear Regression , 2005 .

[59]  P. Sebastiani,et al.  Bayesian Selection of Decomposable Models With Incomplete Data , 2001 .

[60]  Invisible fence methods and the identification of differentially expressed gene sets , 2011 .

[61]  A simplified adaptive fence procedure , 2009 .