Hyper-Sparse Optimal Aggregation

Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow's Cp and to cross-validation selection procedures.

[1]  H. Triebel Theory Of Function Spaces , 1983 .

[2]  M. Cedex Hyper-Sparse Optimal Aggregation , 2011 .

[3]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[4]  Alexandre B. Tsybakov,et al.  Optimal Rates of Aggregation , 2003, COLT.

[5]  P. Massart,et al.  An Adaptive Compression Algorithm in Besov Spaces , 2000 .

[6]  G. Kerkyacharian,et al.  Nonlinear estimation in anisotropic multi-index denoising , 2001 .

[7]  S. Mendelson,et al.  Regularization in kernel learning , 2010, 1001.2094.

[8]  D. Picard,et al.  Non Linear Estimation in Anisotropic Multiindex Denoising , 1999 .

[9]  Jean-Yves Audibert Fast learning rates in statistical inference through aggregation , 2007, math/0703854.

[10]  S. Mendelson,et al.  Aggregation via empirical risk minimization , 2009 .

[11]  A. Juditsky,et al.  Learning by mirror averaging , 2005, math/0511468.

[12]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[13]  S. Mendelson,et al.  Sharper lower bounds on the performance of the empirical risk minimization algorithm , 2011, 1102.4983.

[14]  Andrew R. Barron,et al.  Information Theory and Mixing Least-Squares Regressions , 2006, IEEE Transactions on Information Theory.

[15]  Arnak S. Dalalyan,et al.  Aggregation by Exponential Weighting and Sharp Oracle Inequalities , 2007, COLT.

[16]  Shahar Mendelson,et al.  Lower Bounds for the Empirical Minimization Algorithm , 2008, IEEE Transactions on Information Theory.

[17]  Marc Hoffmann Random rates in anisotropic regression , 2002 .

[18]  Michael H. Neumann MULTIVARIATE WAVELET THRESHOLDING IN ANISOTROPIC FUNCTION SPACES , 2000 .

[19]  A. Tsybakov,et al.  Introduction à l'estimation non-paramétrique , 2003 .

[20]  Yuhong Yang Mixing Strategies for Density Estimation , 2000 .

[21]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[22]  Max L. Warshauer,et al.  Lecture Notes in Mathematics , 2001 .

[23]  Yuhong Yang Aggregating regression procedures to improve performance , 2004 .

[24]  Guillaume Lecué Lower Bounds and Aggregation in Density Estimation , 2006, J. Mach. Learn. Res..

[25]  Olivier Catoni,et al.  Statistical learning theory and stochastic optimization , 2004 .

[26]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[27]  D. Mason,et al.  Some universal results on the behavior of increments of partial sums , 1996 .

[28]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[29]  E. Giné,et al.  Some Limit Theorems for Empirical Processes , 1984 .

[30]  S. Loustau Penalized empirical risk minimization over Besov spaces , 2009 .

[31]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[32]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[33]  S. Geer Applications of empirical process theory , 2000 .

[34]  Alexander V. Nazin,et al.  Recursive Aggregation of Estimators by the Mirror Descent Algorithm with Averaging , 2005, Probl. Inf. Transm..

[35]  Peter L. Bartlett,et al.  The importance of convexity in learning with squared loss , 1998, COLT '96.

[36]  G. Wahba Spline models for observational data , 1990 .

[37]  J. Picard,et al.  Statistical learning theory and stochastic optimization : École d'eté de probabilités de Saint-Flour XXXI - 2001 , 2004 .

[38]  R. Adamczak A tail inequality for suprema of unbounded empirical processes with applications to Markov chains , 2007, 0709.3110.

[39]  D. Picard,et al.  Nonlinear Estimation in Anisotropic Multi-Index Denoising. Sparse Case , 2008 .

[40]  M. Talagrand New concentration inequalities in product spaces , 1996 .

[41]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[42]  Gerard Kerkyacharian,et al.  Replicant compression coding in Besov spaces , 2003 .

[43]  Shahar Mendelson,et al.  On the Performance of Kernel Classes , 2003, J. Mach. Learn. Res..

[44]  S. Mendelson,et al.  On the optimality of the aggregate with exponential weights for low temperatures , 2013, 1303.5180.