High-Probability Risk Bounds via Sequential Predictors

Online learning methods yield sequential regret bounds under minimal assumptions and provide in-expectation risk bounds for statistical learning. However, despite the apparent advantage of online guarantees over their statistical counterparts, recent findings indicate that in many important cases, regret bounds may not guarantee tight high-probability risk bounds in the statistical setting. In this work we show that online to batch conversions applied to general online learning algorithms can bypass this limitation. Via a general second-order correction to the loss function defining the regret, we obtain nearly optimal high-probability risk bounds for several classical statistical estimation problems, such as discrete distribution estimation, linear regression, logistic regression, and conditional density estimation. Our analysis relies on the fact that many online learning algorithms are improper, as they are not restricted to use predictors from a given reference class. The improper nature of our estimators enables significant improvements in the dependencies on various problem parameters. Finally, we discuss some computational advantages of our sequential algorithms over their existing batch counterparts.

[1]  Tong Zhang Mathematical Analysis of Machine Learning Algorithms , 2023 .

[2]  Yeshwanth Cherapanamjeri,et al.  Optimal PAC Bounds Without Uniform Convergence , 2023, ArXiv.

[3]  Nikita Zhivotovskiy,et al.  Exploring Local Norms in Exp-concave Statistical Learning , 2023, Annual Conference Computational Learning Theory.

[4]  A. Suresh,et al.  Concentration Bounds for Discrete Distribution Estimation in KL Divergence , 2023, 2023 IEEE International Symposium on Information Theory (ISIT).

[5]  G. Blanchard,et al.  Constant regret for sequence prediction with limited advice , 2022, ALT.

[6]  Dirk van der Hoeven,et al.  A Regret-Variance Trade-Off in Online Learning , 2022, NeurIPS.

[7]  R. Agrawal Finite-sample concentration of the empirical relative entropy around its mean , 2022, ArXiv.

[8]  Varun Kanade,et al.  Exponential Tail Local Rademacher Complexity Risk Bounds Without the Bernstein Condition , 2022, ArXiv.

[9]  Alessandro Rudi,et al.  Mixability made efficient: Fast online multiclass logistic regression , 2021, NeurIPS.

[10]  Julian Zimmert,et al.  Efficient Methods for Online Multiclass Logistic Regression , 2021, ALT.

[11]  Daniel M. Roy,et al.  Minimax rates for conditional density estimation via empirical entropy , 2021, The Annals of Statistics.

[12]  Suhas Vijaykumar,et al.  Localization, Convexity, and Star Aggregation , 2021, NeurIPS.

[13]  Nikita Zhivotovskiy,et al.  Distribution-Free Robust Linear Regression , 2021, Mathematical Statistics and Learning.

[14]  Wouter M. Koolen,et al.  MetaGrad: Adaptation using Multiple Learning Rates in Online Learning , 2021, J. Mach. Learn. Res..

[15]  Nikita Zhivotovskiy,et al.  Exponential Savings in Agnostic Active Learning Through Abstention , 2021, IEEE Transactions on Information Theory.

[16]  N. V. Vinodchandran,et al.  Near-optimal learning of tree-structured distributions by Chow-Liu , 2020, STOC.

[17]  Nikita Zhivotovskiy,et al.  Suboptimality of Constrained Least Squares and Improvements via Non-Linear Predictors , 2020, Bernoulli.

[18]  R. Khardon,et al.  Pseudo-Bayesian Learning via Direct Loss Minimization with Applications to Sparse Gaussian Process Models , 2020, AABI.

[19]  P. Gaillard,et al.  Efficient improper learning for online logistic regression , 2020, COLT.

[20]  Jaouad Mourtada,et al.  An improper estimator with optimal excess risk in misspecified density estimation and logistic regression , 2019, J. Mach. Learn. Res..

[21]  Shahar Mendelson,et al.  An Unrestricted Learning Procedure , 2019, J. ACM.

[22]  O. Bousquet,et al.  Fast classification rates without standard margin assumptions , 2019, Information and Inference: A Journal of the IMA.

[23]  Nicholas J. A. Harvey,et al.  Tight Analyses for Non-Smooth Stochastic Gradient Descent , 2018, COLT.

[24]  Olivier Wintenberger,et al.  Efficient online algorithms for fast-rate regret bounds under sparsity , 2018, NeurIPS.

[25]  Haipeng Luo,et al.  Logistic Regression: The Importance of Being Improper , 2018, COLT.

[26]  Wojciech Kotlowski,et al.  The Many Faces of Exponential Weights in Online Learning , 2018, COLT.

[27]  Nishant Mehta,et al.  Fast rates with high probability in exp-concave statistical learning , 2016, AISTATS.

[28]  Alon Orlitsky,et al.  On Learning Distributions from their Samples , 2015, COLT.

[29]  S. Mendelson On aggregation for heavy-tailed classes , 2015, Probability Theory and Related Fields.

[30]  Karthik Sridharan,et al.  Learning with Square Loss: Localization through Offset Rademacher Complexity , 2015, COLT.

[31]  Ohad Shamir,et al.  The sample complexity of learning linear predictors with the squared loss , 2014, J. Mach. Learn. Res..

[32]  Elad Hazan,et al.  Logistic Regression: Tight Bounds for Stochastic and Online Optimization , 2014, COLT.

[33]  Olivier Wintenberger,et al.  Optimal learning with Bernstein online aggregation , 2014, Machine Learning.

[34]  Koby Crammer,et al.  A generalized online mirror descent with applications to classification and regression , 2013, Machine Learning.

[35]  P. Rigollet,et al.  Optimal learning with Q-aggregation , 2013, 1301.6080.

[36]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[37]  Wojciech Kotlowski,et al.  Bounds on Individual Risk for Log-loss Predictors , 2011, COLT.

[38]  Thomas M. Cover,et al.  Open Problems in Communication and Computation , 2011, Springer New York.

[39]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[40]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[41]  S. Mendelson,et al.  Aggregation via empirical risk minimization , 2009 .

[42]  Ambuj Tewari,et al.  On the Generalization Ability of Online Strongly Convex Programming Algorithms , 2008, NIPS.

[43]  Jean-Yves Audibert,et al.  Progressive mixture rules are deviation suboptimal , 2007, NIPS.

[44]  Jean-Yves Audibert Fast learning rates in statistical inference through aggregation , 2007, math/0703854.

[45]  Tong Zhang From ɛ-entropy to KL-entropy: Analysis of minimum information complexity density estimation , 2006, math/0702653.

[46]  Tong Zhang,et al.  Information-theoretic upper and lower bounds for statistical estimation , 2006, IEEE Transactions on Information Theory.

[47]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[48]  A. Juditsky,et al.  Learning by mirror averaging , 2005, math/0511468.

[49]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[50]  Sham M. Kakade,et al.  Online Bounds for Bayesian Algorithms , 2004, NIPS.

[51]  Dietrich Braess,et al.  Bernstein polynomials and learning theory , 2004, J. Approx. Theory.

[52]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[53]  V. Vovk Competitive On‐line Statistics , 2001 .

[54]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[55]  Manfred K. Warmuth,et al.  Relative Expected Instantaneous Loss Bounds , 2000, J. Comput. Syst. Sci..

[56]  Yuhong Yang Mixing Strategies for Density Estimation , 2000 .

[57]  Nicolò Cesa-Bianchi,et al.  Worst-Case Bounds for the Logarithmic Loss of Predictors , 1999, Machine Learning.

[58]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[59]  Jürgen Forster,et al.  On Relative Loss Bounds in Generalized Linear Regression , 1999, FCT.

[60]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[61]  Neri Merhav,et al.  Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[62]  Andrew R. Barron,et al.  Minimax redundancy for the class of memoryless sources , 1997, IEEE Trans. Inf. Theory.

[63]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[64]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[65]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[66]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[67]  J. Rissanen Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[68]  Olivier Catoni,et al.  Statistical learning theory and stochastic optimization , 2004 .

[69]  Alexandre B. Tsybakov,et al.  Optimal Rates of Aggregation , 2003, COLT.

[70]  Arkadi Nemirovski,et al.  Topics in Non-Parametric Statistics , 2000 .

[71]  O. Catoni The Mixture Approach to Universal Model Selection , 1997 .

[72]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[73]  A. Barron Are Bayes Rules Consistent in Information , 1987 .

[74]  Shun-ichi Amari,et al.  A Theory of Pattern Recognition , 1968 .

[75]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[76]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .