Model‐free variable selection

Summary.  The importance of variable selection in regression has grown in recent years as computing power has encouraged the modelling of data sets of ever‐increasing size. Data mining applications in finance, marketing and bioinformatics are obvious examples. A limitation of nearly all existing variable selection methods is the need to specify the correct model before selection. When the number of predictors is large, model formulation and validation can be difficult or even infeasible. On the basis of the theory of sufficient dimension reduction, we propose a new class of model‐free variable selection approaches. The methods proposed assume no model of any form, require no nonparametric smoothing and allow for general predictor effects. The efficacy of the methods proposed is demonstrated via simulation, and an empirical example is given.

[1]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[2]  D. Cox,et al.  The Choice of Variables in Observational Studies , 1974 .

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .

[5]  Ping Zhang Variable Selection in Nonparametric Regression with Continuous Covariates , 1991 .

[6]  S. Weisberg,et al.  Comments on "Sliced inverse regression for dimension reduction" by K. C. Li , 1991 .

[7]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[8]  A. Atkinson Subset Selection in Regression , 1992 .

[9]  Ping Zhang Model Selection Via Multifold Cross Validation , 1993 .

[10]  Ker-Chau Li,et al.  On almost Linearity of Low Dimensional Projections from High Dimensional Data , 1993 .

[11]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[12]  Jerome H. Friedman,et al.  An Overview of Predictive Learning and Function Approximation , 1994 .

[13]  R. Cook,et al.  Reweighting to Achieve Elliptically Contoured Covariates in Regression , 1994 .

[14]  R. Cook On the Interpretation of Regression Plots , 1994 .

[15]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[16]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[17]  Wolfgang Härdle,et al.  Search for significant variables in nonparametric additive regression , 1996 .

[18]  R. Cook Graphics for regressions with a binary response , 1996 .

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  Ker-Chau Li,et al.  Nonlinear confounding in high-dimensional regression , 1997 .

[21]  Yuhong Yang MODEL SELECTION FOR NONPARAMETRIC REGRESSION , 1997 .

[22]  R. H. Moore,et al.  Regression Graphics: Ideas for Studying Regressions Through Graphics , 1998, Technometrics.

[23]  T. Ledwina,et al.  Data-Driven Rank Tests for Independence , 1999 .

[24]  R. Tibshirani,et al.  The Covariance Inflation Criterion for Adaptive Model Selection , 1999 .

[25]  J. Brian Gray,et al.  Applied Regression Including Computing and Graphics , 1999, Technometrics.

[26]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[27]  Dean P. Foster,et al.  Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy , 2001 .

[28]  Xiaotong Shen,et al.  Adaptive Model Selection , 2002 .

[29]  R. Cook,et al.  Sufficient Dimension Reduction and Graphics in Regression , 2002 .

[30]  Ursula Gather,et al.  A note On outlier sensitivity of Sliced Inverse Regression , 2002 .

[31]  Margaret T May,et al.  Regression Modelling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. Frank E Harrell Jr, New York: Springer 2001, pp. 568, $79.95. ISBN 0-387-95232-2. , 2002 .

[32]  R. Dennis Cook,et al.  Testing predictor contributions in sufficient dimension reduction , 2004, math/0406520.

[33]  Lexin Li,et al.  Cluster-based estimation for sufficient dimension reduction , 2004, Comput. Stat. Data Anal..

[34]  R. Cook,et al.  SIR3: Dimension reduction in the presence of linearly or nonlinearly related predictors , 2004 .