Online learning via sequential complexities

We consider the problem of sequential prediction and provide tools to study the minimax value of the associated game. Classical statistical learning theory provides several useful complexity measures to study learning with i.i.d. data. Our proposed sequential complexities can be seen as extensions of these measures to the sequential setting. The developed theory is shown to yield precise learning guarantees for the problem of sequential prediction. In particular, we show necessary and sufficient conditions for online learnability in the setting of supervised learning. Several examples show the utility of our framework: we can establish learnability without having to exhibit an explicit online learning algorithm.

[1]  Shahar Mendelson,et al.  A Few Notes on Statistical Learning Theory , 2002, Machine Learning Summer School.

[2]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[3]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[4]  D. Blackwell Controlled Random Walks , 2010 .

[5]  G. Lugosi,et al.  On Prediction of Individual Sequences , 1998 .

[6]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[7]  Jonathan M. Borwein A very complicated proof of the minimax theorem , 2016 .

[8]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[9]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[10]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[11]  R. Dudley,et al.  Uniform Central Limit Theorems: Notation Index , 2014 .

[12]  S. Simons You cannot generalize the minimax theorem too much , 1989 .

[13]  Jonathan M. Borwein,et al.  On Fan's minimax theorem , 1986, Math. Program..

[14]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[15]  Thomas M. Cover,et al.  Compound Bayes Predictors for Sequences with Apparent Markov Structure , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  Yoram Singer,et al.  Convex Repeated Games and Fenchel Duality , 2006, NIPS.

[17]  Ambuj Tewari,et al.  Online Learning: Beyond Regret , 2010, COLT.

[18]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[19]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[20]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[21]  J. Kuelbs Probability on Banach spaces , 1978 .

[22]  E. Giné,et al.  Some Limit Theorems for Empirical Processes , 1984 .

[23]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[24]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[25]  S. Mendelson,et al.  Entropy and the combinatorial dimension , 2002, math/0203275.

[26]  Karthik Sridharan,et al.  Statistical Learning and Sequential Prediction , 2014 .

[27]  Philip Wolfe,et al.  Contributions to the theory of games , 1953 .

[28]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[29]  Adam Tauman Kalai,et al.  The Isotron Algorithm: High-Dimensional Isotonic Regression , 2009, COLT.

[30]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[31]  E. Giné,et al.  Decoupling: From Dependence to Independence , 1998 .

[32]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[33]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[34]  R. Dudley The Sizes of Compact Subsets of Hilbert Space and Continuity of Gaussian Processes , 1967 .

[35]  G. Lugosi,et al.  On Prediction of Individual Sequences , 1998 .

[36]  Peter L. Bartlett,et al.  Adaptive Online Gradient Descent , 2007, NIPS.

[37]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[38]  S. Geer Empirical Processes in M-Estimation , 2000 .

[39]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[40]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[41]  H. Robbins Asymptotically Subminimax Solutions of Compound Statistical Decision Problems , 1985 .

[42]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[43]  Shai Ben-David,et al.  Agnostic Online Learning , 2009, COLT.

[44]  Thomas M. Cover,et al.  Behavior of sequential predictors of binary sequences , 1965 .

[45]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[46]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[47]  Ambuj Tewari,et al.  Online Learning: Random Averages, Combinatorial Parameters, and Learnability , 2010, NIPS.

[48]  Peter L. Bartlett,et al.  A Stochastic View of Optimal Regret through Minimax Duality , 2009, COLT.

[49]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[50]  Jaroslav Kožešnk,et al.  Information Theory, Statistical Decision Functions, Random Processes , 1962 .

[51]  Lee D. Davisson,et al.  Universal noiseless coding , 1973, IEEE Trans. Inf. Theory.

[52]  Neri Merhav,et al.  Universal prediction of individual sequences , 1992, IEEE Trans. Inf. Theory.

[53]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[54]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[55]  Adam Tauman Kalai,et al.  From Batch to Transductive Online Learning , 2005, NIPS.

[56]  Ambuj Tewari,et al.  Sequential complexities and uniform martingale laws of large numbers , 2015 .

[57]  R. Vohra,et al.  Calibrated Learning and Correlated Equilibrium , 1996 .

[58]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[59]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[60]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[61]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[62]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[63]  Vladimir Vapnik,et al.  Inductive principles of the search for empirical dependences (methods based on weak convergence of probability measures) , 1989, COLT '89.

[64]  Ohad Shamir,et al.  Relax and Randomize : From Value to Algorithms , 2012, NIPS.

[65]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[66]  Ambuj Tewari,et al.  Optimal Stragies and Minimax Lower Bounds for Online Convex Games , 2008, COLT.

[67]  A. Kolmogorov,et al.  Entropy and "-capacity of sets in func-tional spaces , 1961 .