Learning to Use Learners' Advice

In this paper, we study a variant of the framework of online learning using expert advice with limited/bandit feedback. We consider each expert as a learning entity, seeking to more accurately reflecting certain real-world applications. In our setting, the feedback at any time $t$ is limited in a sense that it is only available to the expert $i^t$ that has been selected by the central algorithm (forecaster), \emph{i.e.}, only the expert $i^t$ receives feedback from the environment and gets to learn at time $t$. We consider a generic black-box approach whereby the forecaster does not control or know the learning dynamics of the experts apart from knowing the following no-regret learning property: the average regret of any expert $j$ vanishes at a rate of at least $O(t_j^{\regretRate-1})$ with $t_j$ learning steps where $\regretRate \in [0, 1]$ is a parameter. In the spirit of competing against the best action in hindsight in multi-armed bandits problem, our goal here is to be competitive w.r.t. the cumulative losses the algorithm could receive by following the policy of always selecting one expert. We prove the following hardness result: without any coordination between the forecaster and the experts, it is impossible to design a forecaster achieving no-regret guarantees. In order to circumvent this hardness result, we consider a practical assumption allowing the forecaster to "guide" the learning process of the experts by filtering/blocking some of the feedbacks observed by them from the environment, \emph{i.e.}, not allowing the selected expert $i^t$ to learn at time $t$ for some time steps. Then, we design a novel no-regret learning algorithm \algo for this problem setting by carefully guiding the feedbacks observed by experts. We prove that \algo achieves the worst-case expected cumulative regret of $O(\Time^\frac{1}{2 - \regretRate})$ after $\Time$ time steps.

[1]  Satyen Kale,et al.  Multiarmed Bandits With Limited Expert Advice , 2013, COLT.

[2]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[3]  Haipeng Luo,et al.  Online Gradient Boosting , 2015, NIPS.

[4]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[5]  Eli Upfal,et al.  Adapting to a Changing Environment: the Brownian Restless Bandits , 2008, COLT.

[6]  Rémi Munos,et al.  Adaptive Bandits: Towards the best history-dependent strategy , 2011, AISTATS.

[7]  Hsuan-Tien Lin,et al.  Active Learning by Learning , 2015, AAAI.

[8]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[9]  Haipeng Luo,et al.  Corralling a Band of Bandit Algorithms , 2016, COLT.

[10]  Sonia Jaffe,et al.  To Groupon or not to Groupon: The profitability of deep discounts , 2014 .

[11]  Baruch Awerbuch,et al.  Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[12]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[13]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[14]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[15]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[16]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[17]  Haipeng Luo,et al.  Fast Convergence of Regularized Learning in Games , 2015, NIPS.

[18]  Omar Besbes,et al.  Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards , 2014, Stochastic Systems.

[19]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[20]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[21]  Matthew J. Streeter,et al.  Tighter Bounds for Multi-Armed Bandits with Expert Advice , 2009, COLT.

[22]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[23]  Csaba Szepesvári,et al.  Partial Monitoring - Classification, Regret Bounds, and Algorithms , 2014, Math. Oper. Res..

[24]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[25]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[26]  Andreas Krause,et al.  Actively Learning Hemimetrics with Applications to Eliciting User Preferences , 2016, ICML.

[27]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[28]  Ambuj Tewari,et al.  Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret , 2012, ICML.

[29]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[30]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..