Variable selection when confronted with missing data

Variable selection is a common problem in linear regression. Stepwise methods, such as forward selection, are popular and are easily available in most statistical packages. The models selected by these methods have a number of drawbacks: they are often unstable, with changes in the set of variable selected due to small changes in the data, and they provide upwardly biased regression coefficient estimates. Recently proposed methods, such as the lasso, provide accurate predictions via a parsimonious, interpretable model. Missing data values are also a common problem, especially in longitudinal studies. One approach to account for missing data is multiple imputation. The simulation studies were conducted comparing the lasso to standard variable selection methods under different missing data conditions, including the percentage of missing values and the missing data mechanism. Under missing at random mechanisms, missing data were created at the 25 and 50 percent levels with two types of regression parameters, one containing large effects and one containing several small, but nonzero, effects. Five correlation structures were used in generating the data: independent, autoregressive with correlation 0.25 and 0.50, and equicorrelated again with correlation 0.25 and 0.50. Three different missing data mechanisms were used to create the missing data: linear, convex and sinister. Least angle regression performed well under all conditions when the true regression parameter vector contained large effects, with its dominance increasing as the correlation between the predictor variables increased. This is consistent with complete data simulations studies suggesting the lasso performed poorly in situations where the true beta vector contained small, nonzero effects. When the true beta vector contained small, nonzero effects, the performance of the variable selection methods considered was situation dependent. Ordinary least squares had superior performance in terms confidence interval coverage under the independent correlation structure and with correlated data when the true regression parameter vector consists of small, nonzero effects. A variety of methods performed well when the regression parameter vector consisted of large effects and the predictor variables were correlated depending on the missing data situation.

[1]  R. Dahl,et al.  Stimulatory tests of growth hormone secretion in prepubertal major depression: depressed versus normal children. , 1994, Journal of the American Academy of Child and Adolescent Psychiatry.

[2]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[3]  N. Lazar,et al.  Methods and Criteria for Model Selection , 2004 .

[4]  Willi Sauerbrei,et al.  The Use of Resampling Methods to Simplify Regression Models in Medical Statistics , 1999 .

[5]  Alan J. Miller Subset Selection in Regression , 1992 .

[6]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[7]  G. Wahba,et al.  A NOTE ON THE LASSO AND RELATED PROCEDURES IN MODEL SELECTION , 2006 .

[8]  Andrew L. Rukhin,et al.  Tools for statistical inference , 1991 .

[9]  Xiao-Li Meng,et al.  Missing Data: Dial M for ??? , 2000 .

[10]  G. Simons,et al.  On the theory of elliptically contoured distributions , 1981 .

[11]  R. Dahl,et al.  Corticotropin-releasing hormone challenge in prepubertal major depression , 1996, Biological Psychiatry.

[12]  R. Dahl,et al.  Electroencephalographic sleep measures in prepubertal depression , 1991, Psychiatry Research.

[13]  Sandra M. Neer,et al.  The Screen for Child Anxiety Related Emotional Disorders (SCARED): scale construction and psychometric characteristics. , 1997, Journal of the American Academy of Child and Adolescent Psychiatry.

[14]  D. Rubin,et al.  Fully conditional specification in multivariate imputation , 2006 .

[15]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[16]  N. Wermuth,et al.  A Simulation Study of Alternatives to Ordinary Least Squares , 1977 .

[17]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[18]  B. Geller Growth Hormone Secretion in Children and Adolescents at High Risk for Major Depressive Disorder , 2000 .

[19]  C. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[20]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[21]  S. van Buuren,et al.  Flexibele multiple imputation by chained equations of the AVO-95 Survey , 1999 .

[22]  J. Copas,et al.  Missing at random, likelihood ignorability and model completeness , 2004, math/0406451.

[23]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[24]  B. Birmaher,et al.  Psychometric properties of the Screen for Child Anxiety Related Emotional Disorders (SCARED): a replication study. , 1999, Journal of the American Academy of Child and Adolescent Psychiatry.

[25]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[26]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[27]  B. Muthén,et al.  Fluoxetine for the treatment of childhood anxiety disorders: open-label, long-term extension to a controlled trial. , 2005, Journal of the American Academy of Child and Adolescent Psychiatry.

[28]  J. Schafer,et al.  A comparison of inclusive and restrictive strategies in modern missing data procedures. , 2001, Psychological methods.

[29]  R. Dahl,et al.  Neuroendocrine response to 5-hydroxy-L-tryptophan in prepubertal children at high risk of major depressive disorder. , 1997, Archives of general psychiatry.

[30]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[31]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[32]  S. van Buuren,et al.  Flexible mutlivariate imputation by MICE , 1999 .

[33]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[34]  R. Dahl,et al.  Baseline thyroid hormones in depressed and non-depressed pre- and early-pubertal boys and girls. , 1997, Journal of psychiatric research.

[35]  H. Boshuizen,et al.  Multiple imputation of missing blood pressure covariates in survival analysis. , 1999, Statistics in medicine.

[36]  J. C. van Houwelingen,et al.  Shrinkage and Penalized Likelihood as Methods to Improve Predictive Accuracy , 2001 .

[37]  E. George The Variable Selection Problem , 2000 .

[38]  R. Dahl,et al.  Childhood and adolescent depression: a review of the past 10 years. Part I. , 1996, Journal of the American Academy of Child and Adolescent Psychiatry.

[39]  Willi Sauerbrei,et al.  Variable Selection and Shrinkage: Comparison of Some Approaches , 2001 .

[40]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[41]  R. Dahl,et al.  Low growth hormone response to growth hormone–releasing hormone in child depression , 2000, Biological Psychiatry.

[42]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[43]  R. Dahl,et al.  Neuroendocrine response to L-5-hydroxytryptophan challenge in prepubertal major depression. Depressed vs normal children. , 1992, Archives of general psychiatry.

[44]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.