Follow the leader if you can, hedge if you must

Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has poor performance for worst-case data. Other hedging strategies have better worst-case guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combines the best of both worlds. As a stepping stone for our analysis, we develop AdaHedge, which is a new way of dynamically tuning the learning rate in Hedge without using the doubling trick. AdaHedge refines a method by Cesa-Bianchi, Mansour, and Stoltz (2007), yielding improved worst-case guarantees. By interleaving AdaHedge and FTL, FlipFlop achieves regret within a constant factor of the FTL regret, without sacrificing AdaHedge's worst-case guarantees. AdaHedge and FlipFlop do not need to know the range of the losses in advance; moreover, unlike earlier methods, both have the intuitive property that the issued weights are invariant under rescaling and translation of the losses. The losses are also allowed to be negative, in which case they may be interpreted as gains.

[1]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[2]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[4]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[5]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[6]  V. Vovk Competitive On‐line Statistics , 2001 .

[7]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[8]  Gunnar Rätsch,et al.  Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection , 2004, J. Mach. Learn. Res..

[9]  Sham M. Kakade,et al.  Worst-Case Bounds for Gaussian Process Models , 2005, NIPS.

[10]  Akimichi Takemura,et al.  Defensive Forecasting , 2005, AISTATS.

[11]  Marcus Hutter,et al.  Adaptive Online Prediction by Following the Perturbed Leader , 2005, J. Mach. Learn. Res..

[12]  Yuri Kalnishkan,et al.  The weak aggregating algorithm and weak mixability , 2005, J. Comput. Syst. Sci..

[13]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, Journal of computer and system sciences (Print).

[14]  Tong Zhang From ɛ-entropy to KL-entropy: Analysis of minimum information complexity density estimation , 2006, math/0702653.

[15]  Tong Zhang,et al.  Information-theoretic upper and lower bounds for statistical estimation , 2006, IEEE Transactions on Information Theory.

[16]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[17]  Yishay Mansour,et al.  Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.

[18]  John Langford,et al.  Suboptimal behavior of Bayes and MDL in classification under misspecification , 2004, Machine Learning.

[19]  L. Gyorfi,et al.  Sequential Prediction of Unbounded Stationary Time Series , 2007, IEEE Transactions on Information Theory.

[20]  Yoav Freund,et al.  A Parameter-free Hedging Algorithm , 2009, NIPS.

[21]  Vladimir Vovk,et al.  Prediction with Advice of Unknown Number of Experts , 2010, UAI.

[22]  Wouter M. Koolen,et al.  Hedging Structured Concepts , 2010, COLT.

[23]  Elad Hazan,et al.  Extracting certainty from uncertainty: regret bounded by variation in costs , 2008, Machine Learning.

[24]  Peter Grünwald,et al.  Safe Learning: bridging the gap between Bayes, MDL and statistical learning theory via empirical convexity , 2011, COLT.

[25]  Wouter M. Koolen,et al.  Adaptive Hedge , 2011, NIPS.

[26]  Sébastien Gerchinovitz Prédiction de suites individuelles et cadre statistique classique : étude de quelques liens autour de la régression parcimonieuse et des techniques d'agrégation , 2011 .

[27]  Gilles Stoltz,et al.  Forecasting electricity consumption by aggregating specialized experts , 2012, Machine Learning.

[28]  Peter Grünwald,et al.  The Safe Bayesian - Learning the Learning Rate via the Mixability Gap , 2012, ALT.

[29]  Gilles Stoltz,et al.  Forecasting electricity consumption by aggregating specialized experts A review of the sequential aggregation of specialized experts, with an application to Slovakian and French country-wide one-day-ahead (half-)hourly predictions , 2012 .