Sparse Regression Learning by Aggregation and Langevin Monte-Carlo

We consider the problem of regression learning for deterministic design and independent random errors. We start by proving a sharp PAC-Bayesian type bound for the exponentially weighted aggregate (EWA) under the expected squared empirical loss. For a broad class of noise distributions the presented bound is valid whenever the temperature parameter @b of the EWA is larger than or equal to 4@s^2, where @s^2 is the noise variance. A remarkable feature of this result is that it is valid even for unbounded regression functions and the choice of the temperature parameter depends exclusively on the noise level. Next, we apply this general bound to the problem of aggregating the elements of a finite-dimensional linear space spanned by a dictionary of functions @f"1,...,@f"M. We allow M to be much larger than the sample size n but we assume that the true regression function can be well approximated by a sparse linear combination of functions @f"j. Under this sparsity scenario, we propose an EWA with a heavy tailed prior and we show that it satisfies a sparsity oracle inequality with leading constant one. Finally, we propose several Langevin Monte-Carlo algorithms to approximately compute such an EWA when the number M of aggregated functions can be large. We discuss in some detail the convergence of these algorithms and present numerical experiments that confirm our theoretical findings.

[1]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[2]  Arnak S. Dalalyan,et al.  Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity , 2008, Machine Learning.

[3]  G. Roberts,et al.  Langevin Diffusions and Metropolis-Hastings Algorithms , 2002 .

[4]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[5]  Florian Steinke,et al.  Bayesian Inference and Optimal Design in the Sparse Linear Model , 2007, AISTATS.

[6]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[7]  Manfred K. Warmuth,et al.  Averaging Expert Predictions , 1999, EuroCOLT.

[8]  V. Koltchinskii Sparse recovery in convex hulls via entropy penalization , 2009, 0905.2078.

[9]  Bin Yu,et al.  Embracing Statistical Challenges in the Information Technology Age , 2007, Technometrics.

[10]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[11]  J. Kent Time-reversible diffusions , 1978 .

[12]  Alexandros Beskos,et al.  A Factorisation of Diffusion Measure and Finite Sample Path Constructions , 2008 .

[13]  W SeegerMatthias Bayesian Inference and Optimal Design for the Sparse Linear Model , 2008 .

[14]  David Haussler,et al.  Sequential Prediction of Individual Sequences Under General Loss Functions , 1998, IEEE Trans. Inf. Theory.

[15]  Florentina Bunea,et al.  Aggregation and sparsity via 1 penalized least squares , 2006 .

[16]  Olivier Catoni,et al.  Statistical learning theory and stochastic optimization , 2004 .

[17]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[18]  Felix Abramovich,et al.  On optimality of Bayesian testimation in the normal means problem , 2007, 0712.0904.

[19]  A. Tsybakov,et al.  Sparsity oracle inequalities for the Lasso , 2007, 0705.3308.

[20]  David A. McAllester PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[21]  A. Juditsky,et al.  Learning by mirror averaging , 2005, math/0511468.

[22]  Arnak S. Dalalyan,et al.  Aggregation by Exponential Weighting and Sharp Oracle Inequalities , 2007, COLT.

[23]  Pierre Alquier PAC-Bayesian bounds for randomized empirical risk minimizers , 2007, 0712.1698.

[24]  Michael Elad,et al.  Stable recovery of sparse overcomplete representations in the presence of noise , 2006, IEEE Transactions on Information Theory.

[25]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[26]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[27]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[28]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[29]  Yuhong Yang Aggregating regression procedures to improve performance , 2004 .

[30]  Bhaskar D. Rao,et al.  An Empirical Bayesian Strategy for Solving the Simultaneous Sparse Approximation Problem , 2007, IEEE Transactions on Signal Processing.

[31]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[32]  O. Catoni PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[33]  Vincent Rivoirard,et al.  Nonlinear estimation over weak Besov spaces and minimax Bayes method , 2006 .

[34]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[35]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[36]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[37]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[38]  I. Johnstone,et al.  Empirical Bayes selection of wavelet thresholds , 2005, math/0508281.

[39]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[40]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[41]  Jean-Yves Audibert Fast learning rates in statistical inference through aggregation , 2007, math/0703854.

[42]  Andrew R. Barron,et al.  Information Theory and Mixing Least-Squares Regressions , 2006, IEEE Transactions on Information Theory.

[43]  L. Rogers,et al.  Diffusions, Markov processes, and martingales , 1979 .

[44]  Emmanuel J. Candès,et al.  Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? , 2004, IEEE Transactions on Information Theory.

[45]  G. Lecu'e,et al.  Optimal rates and adaptation in the single-index model using aggregation , 2007, math/0703706.