Variable selection methods in regression: Ignorable problem, outing notable solution

Variable selection in regression – identifying the best subset among many variables to include in a model – is arguably the hardest part of model building. Many variable selection methods exist. Many statisticians know them, but few know they produce poorly performing models. Some variable selection methods are a miscarriage of statistics because they are developed by, in effect, debasing sound statistical theory. The purpose of this article is two-fold: (1) to re-examine the scope of the literature addressing the weaknesses of variable selection methods and (2) to re-enliven a possible solution to defining a better-performing regression model. To achieve this goal in practice, the article has two objectives: (1) to review five widely used variable selection methods, itemize some of their weaknesses and consider why they are used; and (2) to present Tukey's Exploratory Data Analysis (EDA) in the context of a natural seven-step cycle. Newcomers to Tukey's EDA should consider the seven-step cycle introduced in the narrative of Tukey's analytic philosophy. John W. Tukey (16 June 1915 – 26 July 2000) was a significant contributor to the field of statistics, but was a humble, unpretentious man, as he always considered himself a data analyst. Tukey's seminal book, Exploratory Data Analysis is known by the book's unique initialed title, EDA.

[1]  R. H. Stumpf,et al.  Graphical exploratory data analysis , 1986 .

[2]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[3]  D. Altman,et al.  Bootstrap investigation of the stability of a Cox regression model. , 1989, Statistics in medicine.

[4]  A. Atkinson Subset Selection in Regression , 1992 .

[5]  Bruce Ratner,et al.  Statistical Modeling and Analysis for Database Marketing: Effective Techniques for Mining Big Data , 2003 .

[6]  D. N. Kashid,et al.  A MORE GENERAL CRITERION FOR SUBSET SELECTION IN MULTIPLE LINEAR REGRESSION , 2002 .

[7]  F. Mosteller,et al.  Understanding robust and exploratory data analysis , 1985 .

[8]  H. V. Henderson,et al.  Building Multiple Regression Models Interactively , 1981 .

[9]  I. Bernstein Applied Multivariate Analysis , 1988 .

[10]  Shyi-Ming Chen,et al.  A New Method for Feature Subset Selection for Handling Classification Problems , 2005, The 14th IEEE International Conference on Fuzzy Systems, 2005. FUZZ '05..

[11]  Beat Kleiner,et al.  Graphical Methods for Data Analysis , 1983 .

[12]  Ellen B. Roecker,et al.  Prediction error and its estimation for subset-selected models , 1991 .

[13]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[14]  G. Vining,et al.  Data Analysis: A Model-Comparison Approach , 1989 .

[15]  William G. Jacoby Statistical Graphics for Univariate and Bivariate Data , 1997 .

[16]  J. Fox Applied Regression Analysis, Linear Models, and Related Methods , 1997 .

[17]  Frederick Mosteller,et al.  Data Analysis and Regression , 1978 .

[18]  J. Copas Regression, Prediction and Shrinkage , 1983 .

[19]  H. Keselman,et al.  Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables , 1992 .

[20]  Lori B. Pfahler Statistical Graphics for Visualizing Multivariate Data , 1998, Technometrics.

[21]  G. Box Science and Statistics , 1976 .