Robust VIF regression with application to variable selection in large data sets

The sophisticated and automated means of data collection used by an increasing number of institutions and companies leads to extremely large datasets. Subset selection in regression is essential when a huge number of covariates can potentially explain a response variable of interest. The recent statistical literature has seen an emergence of new selection methods that provide some type of compromise between implementation (computational speed) and statistical optimality (e.g. prediction error minimization). Global methods such as Mallows’ Cp have been supplanted by sequential methods such as stepwise regression. More recently, streamwise regression, faster than the former, has emerged. A recently proposed streamwise regression approach based on the variance inflation factor (VIF) is promising but its least-squares based implementation makes it susceptible to the outliers inevitable in such large datasets. This lack of robustness can lead to poor and suboptimal feature selection. In our case, we seek to predict an individual’s educational attainment using economic and demographic variables. We show how classical VIF performs this task poorly and a robust procedure is necessary for policy makers. This article proposes a robust VIF regression, based on fast robust estimators, that inherits all the good properties of classical VIF in the absence of outliers, but also continues to perform well in their presence where the classical approach fails.

[1]  Cecilia Elena Rouse,et al.  Democratization or Diversion? The Effect of Community Colleges on Educational Attainment , 1995 .

[2]  B. Efron The Estimation of Prediction Error , 2004 .

[3]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[4]  Stephane Heritier,et al.  Robust Methods in Biostatistics , 2009 .

[5]  Tong Zhang,et al.  Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models , 2008, NIPS.

[6]  Jing Zhou,et al.  Streamwise Feature Selection , 2006, J. Mach. Learn. Res..

[7]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[8]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[9]  Dean P. Foster,et al.  α‐investing: a procedure for sequential control of expected false discoveries , 2008 .

[10]  Damon Clark Do Recessions Keep Students in School? The Impact of Youth Unemployment on Enrolment in Post‐Compulsory Education in England , 2011 .

[11]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[12]  José A.F. Machado,et al.  Robust Model Selection and M-Estimation , 1993, Econometric Theory.

[13]  P. J. Huber The behavior of maximum likelihood estimates under nonstandard conditions , 1967 .

[14]  Gregory S. Kienzl,et al.  The effect of local labor market conditions in the 1990s on the likelihood of community college students’ persistence and attainment , 2007 .

[15]  Peter James Wetterlind,et al.  A multi-variable input model for the projection of higher education enrollments in Arizona , 1976 .

[16]  Elvezio Ronchetti,et al.  A Robust Version of Mallows's C P , 1994 .

[17]  Dean P. Foster,et al.  Variable Selection in Data Mining , 2004 .

[18]  F. Hampel The Influence Curve and Its Role in Robust Estimation , 1974 .

[19]  J. Friedman Fast sparse regression and classification , 2012 .

[20]  T. Gneiting Making and Evaluating Point Forecasts , 2009, 0912.0902.

[21]  M. Victoria-Feser,et al.  A Robust Coefficient of Determination for Regression , 2010 .

[22]  Barbara Petrongolo,et al.  Staying-on at school at 16: the impact of labor market conditions in Spain , 2002 .

[23]  Mitchell R. Williams,et al.  COMMUNITY COLLEGE ENROLLMENT AS A FUNCTION OF ECONOMIC INDICATORS , 2002 .

[24]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[25]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[26]  F. Hampel Contributions to the theory of robust estimation , 1968 .

[27]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[28]  Debbie J. Dupuis,et al.  Fast Robust Model Selection in Large Datasets , 2010 .

[29]  Dean P. Foster,et al.  VIF Regression: A Fast Regression Algorithm for Large Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[30]  Elvezio Ronchetti,et al.  Robust Linear Model Selection by Cross-Validation , 1997 .

[31]  Donald W. Marquaridt Generalized Inverses, Ridge Regression, Biased Linear Estimation, and Nonlinear Estimation , 1970 .

[32]  Stefan Van Aelst,et al.  Propagation of outliers in multivariate data , 2009, 0903.0447.

[33]  C. Mallows Some Comments on Cp , 2000, Technometrics.

[34]  Elvezio Ronchetti,et al.  Robust Testing in Linear Models: The Infinitesimal Approach , 1982 .

[35]  Jafar A. Khan,et al.  Robust Linear Model Selection Based on Least Angle Regression , 2007 .