An improper estimator with optimal excess risk in misspecified density estimation and logistic regression

We introduce a procedure for predictive conditional density estimation under logarithmic loss, which we call SMP (Sample Minmax Predictor). This estimator minimizes a new general excess risk bound for supervised statistical learning. On standard examples, this bound scales as $d/n$ with $d$ the model dimension and $n$ the sample size, and critically remains valid under model misspecification. Being an improper (out-of-model) procedure, SMP improves over within-model estimators such as the maximum likelihood estimator, whose excess risk degrades under misspecification. Compared to approaches reducing to the sequential problem, our bounds remove suboptimal $\log n$ factors, addressing an open problem from Gr\"unwald and Kotlowski for the considered models, and can handle unbounded classes. For the Gaussian linear model, the predictions and risk bound of SMP are governed by leverage scores of covariates, nearly matching the optimal risk in the well-specified case without conditions on the noise variance or approximation error of the linear model. For logistic regression, SMP provides a non-Bayesian approach to calibration of probabilistic predictions relying on virtual samples, and can be computed by solving two logistic regressions. It achieves a non-asymptotic excess risk of $O ( (d + B^2R^2)/n )$, where $R$ bounds the norm of features and $B$ that of the comparison parameter; by contrast, no within-model estimator can achieve better rate than $\min( {B R}/{\sqrt{n}}, {d e^{BR}}/{n} )$ in general. This provides a computationally more efficient alternative to Bayesian approaches, which require approximate posterior sampling, thereby partly answering a question by Foster et al. (2018).

[1]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[2]  Andrew R. Barron,et al.  Asymptotic minimax regret for data compression, gambling, and prediction , 1997, IEEE Trans. Inf. Theory.

[3]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[4]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[5]  G. D. Murray,et al.  NOTE ON ESTIMATION OF PROBABILITY DENSITY FUNCTIONS , 1977 .

[6]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[7]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[8]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[9]  Dmitrii Ostrovskii,et al.  Finite-sample Analysis of M-estimators using Self-concordance , 2018, 1810.06838.

[10]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[11]  H. Robbins A Stochastic Approximation Method , 1951 .

[12]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[13]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[14]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[15]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[16]  Haipeng Luo,et al.  Logistic Regression: The Importance of Being Improper , 2018, COLT.

[17]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[18]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[19]  Peter Grünwald,et al.  A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity , 2017, ALT.

[20]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[21]  Feng Liang,et al.  Improved minimax predictive densities under Kullback-Leibler loss , 2006 .

[22]  Nicolas Macris,et al.  Optimal errors and phase transitions in high-dimensional generalized linear models , 2017, Proceedings of the National Academy of Sciences.

[23]  Elad Hazan,et al.  Logistic Regression: Tight Bounds for Stochastic and Online Optimization , 2014, COLT.

[24]  Pierre Gaillard,et al.  A Chaining Algorithm for Online Nonparametric Regression , 2015, COLT.

[25]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[26]  Francis R. Bach,et al.  Self-concordant analysis for logistic regression , 2009, ArXiv.

[27]  Larry Wasserman,et al.  All of Nonparametric Statistics (Springer Texts in Statistics) , 2006 .

[28]  Wojciech Kotlowski,et al.  Maximum Likelihood vs. Sequential Normalized Maximum Likelihood in On-line Density Estimation , 2011, COLT.

[29]  J. Rissanen,et al.  ON SEQUENTIALLY NORMALIZED MAXIMUM LIKELIHOOD MODELS , 2008 .

[30]  Vee Ming Ng,et al.  On the estimation of parametric density functions , 1980 .

[31]  P. Massart,et al.  Minimum contrast estimators on sieves: exponential bounds and rates of convergence , 1998 .

[32]  Ambuj Tewari,et al.  Smoothness, Low Noise and Fast Rates , 2010, NIPS.

[33]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[34]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[35]  D. Freedman,et al.  How Many Variables Should Be Entered in a Regression Equation , 1983 .

[36]  J. Picard,et al.  Statistical learning theory and stochastic optimization : École d'eté de probabilités de Saint-Flour XXXI - 2001 , 2004 .

[37]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[38]  Malay Ghosh,et al.  Nonsubjective priors via predictive relative entropy regret , 2006 .

[39]  V. Spokoiny Parametric estimation. Finite sample theory , 2011, 1111.3029.

[40]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[41]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[42]  J. Berkson Application of the Logistic Function to Bio-Assay , 1944 .

[43]  R. Bhatia Positive Definite Matrices , 2007 .

[44]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[45]  Ron Meir,et al.  Generalization Error Bounds for Bayesian Mixture Algorithms , 2003, J. Mach. Learn. Res..

[46]  Kfir Y. Levy,et al.  Fast Rates for Exp-concave Empirical Risk Minimization , 2015, NIPS.

[47]  A. Barron,et al.  Jeffreys' prior is asymptotically least favorable under entropy risk , 1994 .

[48]  T. Poggio,et al.  STABILITY RESULTS IN LEARNING THEORY , 2005 .

[49]  Yurii Nesterov,et al.  Interior-point polynomial algorithms in convex programming , 1994, Siam studies in applied mathematics.

[50]  Matthew J. Streeter,et al.  Open Problem: Better Bounds for Online Logistic Regression , 2012, COLT.

[51]  F. Komaki On asymptotic properties of predictive distributions , 1996 .

[52]  H. White Maximum Likelihood Estimation of Misspecified Models , 1982 .

[53]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[54]  Alessandro Rudi,et al.  Globally Convergent Newton Methods for Ill-conditioned Generalized Self-concordant Losses , 2019, NeurIPS.

[55]  Sham M. Kakade,et al.  Online Bounds for Bayesian Algorithms , 2004, NIPS.

[56]  R. Z. Khasʹminskiĭ,et al.  Statistical estimation : asymptotic theory , 1981 .

[57]  J. Hájek Local asymptotic minimax and admissibility in estimation , 1972 .

[58]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[59]  Feng Liang,et al.  Exact minimax strategies for predictive density estimation, data compression, and model selection , 2002, IEEE Transactions on Information Theory.

[60]  Y. Baraud,et al.  Rho-estimators revisited: General theory and applications , 2016, The Annals of Statistics.

[61]  Ian R. Harris Predictive fit for natural exponential families , 1989 .

[62]  L. L. Cam,et al.  Asymptotic Methods In Statistical Decision Theory , 1986 .

[63]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[64]  A. Barron Are Bayes Rules Consistent in Information , 1987 .

[65]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[66]  G. Wahba Spline models for observational data , 1990 .

[67]  Francis R. Bach,et al.  Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression , 2013, J. Mach. Learn. Res..

[68]  Jean-Yves Audibert Fast learning rates in statistical inference through aggregation , 2007, math/0703854.

[69]  Yuhong Yang Mixing Strategies for Density Estimation , 2000 .

[70]  Wojciech Kotlowski,et al.  Bounds on Individual Risk for Log-loss Predictors , 2011, COLT.

[71]  Jaouad Mourtada Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices , 2019 .

[72]  Jean-Yves Audibert,et al.  Progressive mixture rules are deviation suboptimal , 2007, NIPS.

[73]  M. Talagrand Upper and Lower Bounds for Stochastic Processes: Modern Methods and Classical Problems , 2014 .

[74]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[75]  E. Candès,et al.  The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression , 2018, The Annals of Statistics.

[76]  Neri Merhav,et al.  Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[77]  A. Juditsky,et al.  Learning by mirror averaging , 2005, math/0511468.

[78]  Alessandro Rudi,et al.  Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance , 2019, COLT.

[79]  T. N. Sriram Asymptotics in Statistics–Some Basic Concepts , 2002 .

[80]  Shai Shalev-Shwartz,et al.  Average Stability is Invariant to Data Preconditioning. Implications to Exp-concave Empirical Risk Minimization , 2016, J. Mach. Learn. Res..

[81]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[82]  Yuhong Yang,et al.  An Asymptotic Property of Model Selection Criteria , 1998, IEEE Trans. Inf. Theory.

[83]  Nishant Mehta,et al.  Fast rates with high probability in exp-concave statistical learning , 2016, AISTATS.

[84]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[85]  Rong Jin,et al.  Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization , 2015, COLT.

[86]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[87]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[88]  Shahar Mendelson,et al.  Learning without Concentration , 2014, COLT.

[89]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[90]  Mihaela Aslan,et al.  Asymptotically minimax Bayes predictive densities , 2006, 0708.0177.

[91]  Edward I. George,et al.  Admissible predictive density estimation , 2008 .

[92]  Jayanta K. Ghosh,et al.  Higher Order Asymptotics , 1994 .

[93]  W. Wong,et al.  Probability inequalities for likelihood ratios and convergence rates of sieve MLEs , 1995 .

[94]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[95]  Soumendu Sundar Mukherjee,et al.  Weak convergence and empirical processes , 2019 .

[96]  Manfred K. Warmuth,et al.  The Last-Step Minimax Algorithm , 2000, ALT.

[97]  V. Koltchinskii,et al.  Bounding the smallest singular value of a random matrix without concentration , 2013, 1312.3580.

[98]  J. Aitchison Goodness of prediction fit , 1975 .

[99]  Alessandro Rudi,et al.  Efficient improper learning for online logistic regression , 2020, COLT.

[100]  Nicolò Cesa-Bianchi,et al.  Worst-Case Bounds for the Logarithmic Loss of Predictors , 1999, Machine Learning.

[101]  L. Birge,et al.  A new method for estimation and model selection:$$\rho $$ρ-estimation , 2014, 1403.6057.

[102]  Marina Daecher Open Problems In Communication And Computation , 2016 .

[103]  Roberto Imbuzeiro Oliveira,et al.  The lower tail of random quadratic forms with applications to ordinary least squares , 2013, ArXiv.

[104]  J. Hartigan The maximum likelihood prior , 1998 .

[105]  P. Massart,et al.  Rates of convergence for minimum contrast estimators , 1993 .

[106]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[107]  Tong Zhang From ɛ-entropy to KL-entropy: Analysis of minimum information complexity density estimation , 2006, math/0702653.

[108]  Peter Grünwald,et al.  Fast Rates with Unbounded Losses , 2016, ArXiv.

[109]  L. Birge,et al.  Rho-estimators for shape restricted density estimation , 2016 .

[110]  Nathan Srebro,et al.  Fast Rates for Regularized Objectives , 2008, NIPS.

[111]  S. Mendelson,et al.  Performance of empirical risk minimization in linear aggregation , 2014, 1402.5763.

[112]  Peter L. Bartlett,et al.  Horizon-Independent Optimal Prediction with Log-Loss in Exponential Families , 2013, COLT.

[113]  V. Koltchinskii,et al.  Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[114]  Peter Grünwald,et al.  Fast Rates for General Unbounded Loss Functions: From ERM to Generalized Bayes , 2016, J. Mach. Learn. Res..

[115]  O. Catoni The Mixture Approach to Universal Model Selection , 1997 .

[116]  Luela Prifti,et al.  ON PARAMETRIC ESTIMATION , 2015 .

[117]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[118]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[119]  Babak Hassibi,et al.  The Impact of Regularization on High-dimensional Logistic Regression , 2019, NeurIPS.

[120]  E. Candès,et al.  A modern maximum-likelihood theory for high-dimensional logistic regression , 2018, Proceedings of the National Academy of Sciences.

[121]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .