Kullback-Leibler aggregation and misspecified generalized linear models

In a regression setup with deterministic design, we study the pure aggregation problem and introduce a natural extension from the Gaussian distribution to distributions in the exponential family. While this extension bears strong connections with generalized linear models, it does not require identifiability of the parameter or even that the model on the systematic component is true. It is shown that this problem can be solved by constrained and/or penalized likelihood maximization and we derive sharp oracle inequalities that hold both in expectation and with high probability. Finally all the bounds are proved to be optimal in a minimax sense.

[1]  Alexandre B. Tsybakov,et al.  Optimal Rates of Aggregation , 2003, COLT.

[2]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[3]  S. Geer,et al.  General oracle inequalities for model selection , 2009 .

[4]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[5]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[6]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[7]  I. Ekeland,et al.  Convex analysis and variational problems , 1976 .

[8]  V. Koltchinskii,et al.  Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[9]  YuBin,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2011 .

[10]  Arnak S. Dalalyan,et al.  Aggregation by Exponential Weighting and Sharp Oracle Inequalities , 2007, COLT.

[11]  Karim Lounici Generalized mirror averaging and D-convex aggregation , 2007 .

[12]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[13]  David Mease,et al.  Evidence Contrary to the Statistical View of Boosting , 2008, J. Mach. Learn. Res..

[14]  Yuhong Yang Mixing Strategies for Density Estimation , 2000 .

[15]  G. Lecu'e Simultaneous adaptation to the margin and to complexity in classification , 2005, math/0509696.

[16]  A. Juditsky,et al.  Functional aggregation for nonparametric regression , 2000 .

[17]  J. F. C. Kingman,et al.  Information and Exponential Families in Statistical Theory , 1980 .

[18]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[19]  S. Mendelson,et al.  Aggregation via empirical risk minimization , 2009 .

[20]  Arkadi Nemirovski,et al.  Topics in Non-Parametric Statistics , 2000 .

[21]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[22]  Karim Lounici,et al.  Pac-Bayesian Bounds for Sparse Regression Estimation with Exponential Weights , 2010, 1009.2707.

[23]  H. White Maximum Likelihood Estimation of Misspecified Models , 1982 .

[24]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[25]  V. Spokoiny,et al.  Spatial aggregation of local likelihood estimates with applications to classification , 2007, 0712.0939.

[26]  A. Dalalyan,et al.  Sharp Oracle Inequalities for Aggregation of Affine Estimators , 2011, 1104.3969.

[27]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[28]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[29]  R. DeVore,et al.  A Simple Proof of the Restricted Isometry Property for Random Matrices , 2008 .

[30]  Yuhong Yang Aggregating regression procedures to improve performance , 2004 .

[31]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[32]  E. Greenshtein Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint , 2006, math/0702684.

[33]  A. Tsybakov,et al.  Linear and convex aggregation of density estimators , 2006, math/0605292.

[34]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[35]  Le Cam,et al.  On some asymptotic properties of maximum likelihood estimates and related Bayes' estimates , 1953 .

[36]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[37]  J. Lynch,et al.  A weak convergence approach to the theory of large deviations , 1997 .

[38]  Jean-Yves Audibert,et al.  Progressive mixture rules are deviation suboptimal , 2007, NIPS.

[39]  Y. Ritov,et al.  Persistence in high-dimensional linear predictor selection and the virtue of overparametrization , 2004 .

[40]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[41]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[42]  Guillaume Lecué,et al.  Suboptimality of Penalized Empirical Risk Minimization in Classification , 2007, COLT.

[43]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[44]  A. Juditsky,et al.  Learning by mirror averaging , 2005, math/0511468.

[45]  Olivier Catoni,et al.  Statistical learning theory and stochastic optimization , 2004 .

[46]  L. Fahrmeir,et al.  Correction: Consistency and Asymptotic Normality of the Maximum Likelihood Estimator in Generalized Linear Models , 1985 .

[47]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[48]  L. Brown Fundamentals of statistical exponential families: with applications in statistical decision theory , 1986 .

[49]  A. Tsybakov,et al.  Exponential Screening and optimal rates of sparse estimation , 2010, 1003.2654.