Logistic Regression: The Importance of Being Improper

Learning linear predictors with the logistic loss---both in stochastic and online settings---is a fundamental task in machine learning and statistics, with direct connections to classification and boosting. Existing "fast rates" for this setting exhibit exponential dependence on the predictor norm, and Hazan et al. (2014) showed that this is unfortunately unimprovable. Starting with the simple observation that the logistic loss is $1$-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm. This provides a positive resolution to a variant of the COLT 2012 open problem of McMahan and Streeter (2012) when improper learning is allowed. This improvement is obtained both in the online setting and, with some extra work, in the batch statistical setting with high probability. We also show that the improved dependence on predictor norm is near-optimal. Leveraging this improved dependency on the predictor norm yields the following applications: (a) we give algorithms for online bandit multiclass learning with the logistic loss with an $\tilde{O}(\sqrt{n})$ relative mistake bound across essentially all parameter ranges, thus providing a solution to the COLT 2009 open problem of Abernethy and Rakhlin (2009), and (b) we give an adaptive algorithm for online multiclass boosting with optimal sample complexity, thus partially resolving an open problem of Beygelzimer et al. (2015) and Jung et al. (2017). Finally, we give information-theoretic bounds on the optimal rates for improper logistic regression with general function classes, thereby characterizing the extent to which our improvement for linear classes extends to other parametric and even nonparametric settings.

[1]  J. Berkson Application of the Logistic Function to Bio-Assay , 1944 .

[2]  Manfred K. Warmuth,et al.  On Weak Learning , 1995, J. Comput. Syst. Sci..

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[4]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[5]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[6]  Neri Merhav,et al.  Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[7]  Gábor Lugosi,et al.  Minimax regret under log loss for general classes of experts , 1999, COLT '99.

[8]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[9]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[10]  Sham M. Kakade,et al.  Online Bounds for Bayesian Algorithms , 2004, NIPS.

[11]  Yoram Singer,et al.  Convex Repeated Games and Fenchel Duality , 2006, NIPS.

[12]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[13]  Santosh S. Vempala,et al.  Fast Algorithms for Logconcave Functions: Sampling, Rounding, Integration and Optimization , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[14]  Santosh S. Vempala,et al.  The geometry of logconcave functions and sampling algorithms , 2007, Random Struct. Algorithms.

[15]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[16]  Ambuj Tewari,et al.  Efficient bandit algorithms for online multiclass prediction , 2008, ICML '08.

[17]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[18]  Francis R. Bach,et al.  Self-concordant analysis for logistic regression , 2009, ArXiv.

[19]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[20]  Ambuj Tewari,et al.  Online Learning: Random Averages, Combinatorial Parameters, and Learnability , 2010, NIPS.

[21]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[22]  Elad Hazan,et al.  Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction , 2011, NIPS.

[23]  Matthew J. Streeter,et al.  Open Problem: Better Bounds for Online Logistic Regression , 2012, COLT.

[24]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[25]  Elad Hazan,et al.  Logistic Regression: Tight Bounds for Stochastic and Online Optimization , 2014, COLT.

[26]  Karthik Sridharan,et al.  Online Nonparametric Regression , 2014, ArXiv.

[27]  Francis R. Bach,et al.  Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression , 2013, J. Mach. Learn. Res..

[28]  Ambuj Tewari,et al.  Sequential complexities and uniform martingale laws of large numbers , 2015 .

[29]  Ambuj Tewari,et al.  Online learning via sequential complexities , 2010, J. Mach. Learn. Res..

[30]  Haipeng Luo,et al.  Optimal and Adaptive Algorithms for Online Boosting , 2015, ICML.

[31]  Mark D. Reid,et al.  Fast rates in statistical and online learning , 2015, J. Mach. Learn. Res..

[32]  Karthik Sridharan,et al.  Online Nonparametric Regression with General Loss Functions , 2015, ArXiv.

[33]  Karthik Sridharan,et al.  Sequential Probability Assignment with Binary Alphabets and Large Classes of Experts , 2015, ArXiv.

[34]  Ambuj Tewari,et al.  Online multiclass boosting , 2017, NIPS.

[35]  Nishant Mehta,et al.  Fast rates with high probability in exp-concave statistical learning , 2016, AISTATS.

[36]  Hariharan Narayanan,et al.  Efficient Sampling from Time-Varying Log-Concave Distributions , 2013, J. Mach. Learn. Res..

[37]  Ambuj Tewari,et al.  Online Boosting Algorithms for Multi-label Ranking , 2017, AISTATS.

[38]  Sébastien Bubeck,et al.  Sampling from a Log-Concave Distribution with Projected Langevin Monte Carlo , 2015, Discrete & Computational Geometry.