论文信息 - Follow the Leader with Dropout Perturbations

Follow the Leader with Dropout Perturbations

We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss (i.e. regret) compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader (FPL). The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss (“the leader”) where ties are broken uniformly at random. To achieve the optimal worst-case regret as a function of the lossL of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function ofL . Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or 1 before selecting the leader. We show that our perturbations are an instance of dropout — because experts may be interpreted as features — although for non-binary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuning-free version of the FPL algorithm achieves two feats: optimal worst-case O( p L lnK + lnK) regret as a function ofL , and optimalO(lnK) regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others. A number of recent algorithms from the Hedge family (AdaHedge and FlipFlop) also achieve this, but they employ sophisticated tuning regimes. The dropout perturbation of the losses of the experts result in different noise distributions for each expert (because they depend on the expert’s total loss) and curiously enough no additional tuning is needed: the choice of dropout probability only affects the constants.

Wojciech Kotlowski | Tim van Erven | T. Erven | W. Kotłowski

[1] Jiazhong Nie,et al. Online PCA with Optimal Regrets , 2013, ALT.

[2] Wouter M. Koolen,et al. Adaptive Hedge , 2011, NIPS.

[3] Wouter M. Koolen,et al. Follow the leader if you can, hedge if you must , 2013, J. Mach. Learn. Res..

[4] Manfred K. Warmuth,et al. Optimum Follow the Leader Algorithm , 2005, COLT.

[5] Vladimir Vovk,et al. A game of prediction with expert advice , 1995, COLT '95.

[6] Luc Devroye,et al. Prediction by random-walk perturbation , 2013, COLT.

[7] Shai Shalev-Shwartz,et al. Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[8] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[9] David Haussler,et al. How to use expert advice , 1993, STOC.

[10] Manfred K. Warmuth,et al. Repeated Games against Budgeted Adversaries , 2010, NIPS.

[11] James Hannan,et al. 4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[12] Sida I. Wang,et al. Dropout Training as Adaptive Regularization , 2013, NIPS.

[13] Claudio Gentile,et al. Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[14] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[15] Manfred K. Warmuth,et al. Path Kernels and Multiplicative Updates , 2002, J. Mach. Learn. Res..

[16] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[17] Manfred K. Warmuth,et al. Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[18] Manfred K. Warmuth,et al. Online variance minimization , 2011, Machine Learning.

[19] Santosh S. Vempala,et al. Efficient algorithms for online decision problems , 2005, Journal of computer and system sciences (Print).

[20] Philip M. Long,et al. WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[21] Christopher D. Manning,et al. Fast dropout training , 2013, ICML.

[22] Manfred K. Warmuth,et al. On-line Variance Minimization in O(n2) per Trial? , 2010, COLT.

[23] Manfred K. Warmuth,et al. The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.