Scalable holistic linear regression

Abstract We propose a new scalable algorithm for holistic linear regression building on Bertsimas & King (2016). Specifically, we develop new theory to model significance and multicollinearity as lazy constraints rather than checking the conditions iteratively. The resulting algorithm scales with the number of samples n in the 10,000s, compared to the low 100s in the previous framework. Computational results on real and synthetic datasets show it greatly improves from previous algorithms in accuracy, false detection rate, computational time and scalability.

[1]  F. Eicker A Multivariate Central Limit Theorem for Random Linear Vector Forms , 1966 .

[2]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[3]  Dimitris Bertsimas,et al.  Characterization of the equivalence of robustification and regularization in linear and matrix regression , 2017, Eur. J. Oper. Res..

[4]  A. Wald,et al.  On Stochastic Limit and Order Relationships , 1943 .

[5]  Ken Kobayashi,et al.  BEST SUBSET SELECTION FOR ELIMINATING MULTICOLLINEARITY , 2017 .

[6]  Ken Kobayashi,et al.  Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor , 2018, Journal of Global Optimization.

[7]  R. O’Brien,et al.  A Caution Regarding Rules of Thumb for Variance Inflation Factors , 2007 .

[8]  V. N. Bogaevski,et al.  Matrix Perturbation Theory , 1991 .

[9]  Alexis Lazaridis,et al.  A Note Regarding the Condition Number: The Case of Spurious and Latent Multicollinearity , 2007 .

[10]  M. van Beek An Algorithmic Approach to Linear Regression , 2018 .

[11]  Dimitris Bertsimas,et al.  The Trimmed Lasso: Sparsity and Robustness , 2017, 1708.04527.

[12]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[13]  Taesu Cheong,et al.  A mathematical programming approach for integrated multiple linear regression subset selection and validation , 2017, Pattern Recognit..

[14]  F. Eicker Asymptotic Normality and Consistency of the Least Squares Estimators for Families of Linear Regressions , 1963 .