ZigZag: A New Approach to Adaptive Online Learning

We develop a novel family of algorithms for the online learning setting with regret against any data sequence bounded by the empirical Rademacher complexity of that sequence. To develop a general theory of when this type of adaptive regret bound is achievable we establish a connection to the theory of decoupling inequalities for martingales in Banach spaces. When the hypothesis class is a set of linear functions bounded in some norm, such a regret bound is achievable if and only if the norm satisfies certain decoupling inequalities for martingales. Donald Burkholder's celebrated geometric characterization of decoupling inequalities (1984) states that such an inequality holds if and only if there exists a special function called a Burkholder function satisfying certain restricted concavity properties. Our online learning algorithms are efficient in terms of queries to this function. We realize our general theory by giving novel efficient algorithms for classes including lp norms, Schatten p-norms, group norms, and reproducing kernel Hilbert spaces. The empirical Rademacher complexity regret bound implies --- when used in the i.i.d. setting --- a data-dependent complexity bound for excess risk after online-to-batch conversion. To showcase the power of the empirical Rademacher complexity regret bound, we derive improved rates for a supervised learning generalization of the online learning with low rank experts task and for the online matrix prediction task. In addition to obtaining tight data-dependent regret bounds, our algorithms enjoy improved efficiency over previous techniques based on Rademacher complexity, automatically work in the infinite horizon setting, and are scale-free. To obtain such adaptive methods, we introduce novel machinery, and the resulting algorithms are not based on the standard tools of online convex optimization.

[1]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[2]  Mark Veraar,et al.  Vector-valued decoupling and the Burkholder-Davis-Gundy inequality , 2011, 1107.2218.

[3]  Pawel Hitczenko On a Domination of Sums of Random Variables by Sums of Conditionally Independent Ones , 1994 .

[4]  R. Dudley The Sizes of Compact Subsets of Hilbert Space and Continuity of Gaussian Processes , 1967 .

[5]  Mark Veraar,et al.  Some remarks on tangent martingale difference sequences in $L^1$-spaces , 2007, 0801.0695.

[6]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[7]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[8]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[9]  Ohad Shamir,et al.  Matrix completion with the trace norm: learning, bounding, and transducing , 2014, J. Mach. Learn. Res..

[10]  Karthik Sridharan,et al.  Adaptive Online Learning , 2015, NIPS.

[11]  Khanh Dao Duc,et al.  OPERATOR NORM INEQUALITIES BETWEEN TENSOR UNFOLDINGS ON THE PARTITION LATTICE. , 2016, Linear algebra and its applications.

[12]  Roi Livni,et al.  Online Learning with Low Rank Experts , 2016, COLT.

[13]  Ohad Shamir,et al.  Relax and Randomize : From Value to Algorithms , 2012, NIPS.

[14]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[15]  R. Adamczak,et al.  Concentration inequalities for non-Lipschitz functions with bounded derivatives of higher order , 2013, 1304.1826.

[16]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[17]  P. Hitczenko Domination inequality for martingale transforms of a Rademacher sequence , 1993 .

[18]  Nathan Srebro,et al.  Concentration-Based Guarantees for Low-Rank Matrix Reconstruction , 2011, COLT.

[19]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[20]  S. Treil,et al.  Bellman function in stochastic control and harmonic analysis , 2001 .

[21]  Shai Shalev-Shwartz,et al.  Near-Optimal Algorithms for Online Matrix Prediction , 2012, COLT.

[22]  D. Burkholder,et al.  Boundary Value Problems and Sharp Inequalities for Martingale Transforms , 1984 .

[23]  D. Burkholder Martingales and Fourier analysis in Banach spaces , 1986 .

[24]  Ambuj Tewari,et al.  Online Learning: Random Averages, Combinatorial Parameters, and Learnability , 2010, NIPS.

[25]  Karthik Sridharan,et al.  Statistical Learning and Sequential Prediction , 2014 .

[26]  Adam Osękowski,et al.  Sharp Martingale and Semimartingale Inequalities , 2012 .

[27]  Karthik Sridharan,et al.  Online Nonparametric Regression , 2014, ArXiv.

[28]  Shie Mannor,et al.  Online Learning with Many Experts , 2017, ArXiv.