Sparse hierarchical regression with polynomials

We present a novel method for sparse polynomial regression. We are interested in that degree r polynomial which depends on at most k inputs, counting at most $$\ell$$ ℓ monomial terms, and minimizes the sum of the squares of its prediction errors. Such highly structured sparse regression was denoted by Bach (Advances in neural information processing systems, pp 105–112, 2009) as sparse hierarchical regression in the context of kernel learning. Hierarchical sparse specification aligns well with modern big data settings where many inputs are not relevant for prediction purposes and the functional complexity of the regressor needs to be controlled as to avoid overfitting. We propose an efficient two-step approach to this hierarchical sparse regression problem. First, we discard irrelevant inputs using an extremely fast input ranking heuristic. Secondly, we take advantage of modern cutting plane methods for integer optimization to solve the remaining reduced hierarchical $$(k, \ell )$$ ( k , ℓ ) -sparse problem exactly. The ability of our method to identify all k relevant inputs and all $$\ell$$ ℓ monomial terms is shown empirically to experience a phase transition. Crucially, the same transition also presents itself in our ability to reject all irrelevant features and monomials as well. In the regime where our method is statistically powerful, its computational complexity is interestingly on par with Lasso based heuristics. Hierarchical sparsity can retain the flexibility of general nonparametric methods such as nearest neighbors or regression trees ( CART ), without sacrificing much statistical power. The presented work hence fills a void in terms of a lack of powerful disciplined nonlinear sparse regression methods in high-dimensional settings. Our method is shown empirically to scale to regression problems with $$n\approx 10{,}000$$ n ≈ 10 , 000 observations for input dimension $$p\approx 1000$$ p ≈ 1000 .

[1]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[2]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[3]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[4]  Francis R. Bach,et al.  Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.

[5]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[6]  A. Tikhonov On the stability of inverse problems , 1943 .

[7]  H. Kile,et al.  Bandwidth Selection in Kernel Density Estimation , 2010 .

[8]  Ignacio E. Grossmann,et al.  An outer-approximation algorithm for a class of mixed-integer nonlinear programs , 1987, Math. Program..

[9]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[10]  David Gamarnik,et al.  High Dimensional Regression with Binary Coefficients. Estimating Squared Error and a Phase Transtition , 2017, COLT.

[11]  Fabian J. Theis,et al.  TREVOR HASTIE, ROBERT TIBSHIRANI, AND MARTIN WAINWRIGHT. Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton: CRC Press. , 2018, Biometrics.

[12]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[13]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[14]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[15]  V. Vapnik The Support Vector Method of Function Estimation , 1998 .

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Samory Kpotufe,et al.  k-NN Regression Adapts to Local Intrinsic Dimension , 2011, NIPS.

[18]  Victoria Stodden,et al.  Breakdown Point of Model Selection When the Number of Variables Exceeds the Number of Observations , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[19]  T. Poggio,et al.  On optimal nonlinear associative recall , 1975, Biological Cybernetics.

[20]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[21]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[22]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[23]  Dimitris Bertsimas,et al.  Characterization of the equivalence of robustification and regularization in linear and matrix regression , 2017, Eur. J. Oper. Res..

[24]  Johan A. K. Suykens,et al.  LS-SVMlab : a MATLAB / C toolbox for Least Squares Support Vector Machines , 2007 .

[25]  Thomas F. Brooks,et al.  Airfoil self-noise and prediction , 1989 .

[26]  Martin W. P. Savelsbergh,et al.  Branch-and-Price: Column Generation for Solving Huge Integer Programs , 1998, Oper. Res..

[27]  Jianqing Fan,et al.  A Selective Overview of Variable Selection in High Dimensional Feature Space. , 2009, Statistica Sinica.

[28]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[29]  Ling Huang,et al.  Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression , 2010, NIPS.

[30]  K. Smith ON THE STANDARD DEVIATIONS OF ADJUSTED AND INTERPOLATED VALUES OF AN OBSERVED POLYNOMIAL FUNCTION AND ITS CONSTANTS AND THE GUIDANCE THEY GIVE TOWARDS A PROPER CHOICE OF THE DISTRIBUTION OF OBSERVATIONS , 1918 .

[31]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[32]  M. Stone The Generalized Weierstrass Approximation Theorem , 1948 .

[33]  Iain Dunning,et al.  Computing in Operations Research Using Julia , 2013, INFORMS J. Comput..

[34]  Ning Hao,et al.  Interaction Screening for Ultrahigh-Dimensional Data , 2014, Journal of the American Statistical Association.

[35]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[36]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[37]  Yurii Nesterov,et al.  Interior-point polynomial algorithms in convex programming , 1994, Siam studies in applied mathematics.

[38]  E. L. Lawler,et al.  Branch-and-Bound Methods: A Survey , 1966, Oper. Res..

[39]  Bart P. G. Van Parys,et al.  Sparse high-dimensional regression: Exact scalable algorithms and phase transitions , 2017, The Annals of Statistics.

[40]  Yingying Fan,et al.  Interaction pursuit in high-dimensional multi-response regression via distance correlation , 2016, 1605.03315.

[41]  Sven Leyffer,et al.  Solving mixed integer nonlinear programs by outer approximation , 1994, Math. Program..

[42]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[43]  D. Bertsimas,et al.  Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.

[44]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[45]  Athanasios Tsanas,et al.  Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools , 2012 .

[46]  Peter Hall,et al.  On selecting interacting features from high-dimensional data , 2014, Comput. Stat. Data Anal..

[47]  A. Atkinson Subset Selection in Regression , 1992 .

[48]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[49]  I-Cheng Yeh,et al.  Modeling of strength of high-performance concrete using artificial neural networks , 1998 .

[50]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[51]  Fang Zhou,et al.  Predicting the Geographical Origin of Music , 2014, 2014 IEEE International Conference on Data Mining.

[52]  秀俊 松井,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2014 .