On the Self-Penalization Phenomenon in Feature Selection

We describe an implicit sparsity-inducing mechanism based on minimization over a family of kernels: min β,f Ê[L(Y, f(β X)] + λn ‖f‖2Hq subject to β ≥ 0, where L is the loss, is coordinate-wise multiplication and Hq is the reproducing kernel Hilbert space based on the kernel kq(x, x ′) = h(‖x− x‖qq), where ‖·‖q is the `q norm. Using gradient descent to optimize this objective with respect to β leads to exactly sparse stationary points with high probability. The sparsity is achieved without using any of the well-known explicit sparsification techniques such as penalization (e.g., `1), early stopping or post-processing (e.g., clipping). As an application, we use this sparsity-inducing mechanism to build algorithms consistent for feature selection.

[1]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[2]  Michael I. Jordan,et al.  Taming Nonconvexity in Kernel Feature Selection---Favorable Properties of the Laplace Kernel , 2021 .

[3]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[4]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[5]  Reza Modarres,et al.  Measures of Dependence , 2011, International Encyclopedia of Statistical Science.

[6]  Feng Ruan,et al.  A Self-Penalizing Objective Function for Scalable Interaction Detection. , 2020, 2011.12215.

[7]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[8]  C. Baker Joint measures and cross-covariance operators , 1973 .

[9]  J. Hale,et al.  Dynamics and Bifurcations , 1991 .

[10]  Sourav Chatterjee,et al.  A simple measure of conditional dependence , 2019, ArXiv.

[11]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[12]  Paul Grigas,et al.  A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives , 2015, ArXiv.

[13]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[14]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[15]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[16]  Bernhard Schölkopf,et al.  Kernel Measures of Conditional Dependence , 2007, NIPS.

[17]  Matus Telgarsky,et al.  Margins, Shrinkage, and Boosting , 2013, ICML.

[18]  Feng Ruan,et al.  Stochastic Methods for Composite and Weakly Convex Optimization Problems , 2017, SIAM J. Optim..

[19]  Bogdan E. Popescu,et al.  Gradient Directed Regularization , 2004 .

[20]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.