Multi-Armed Bandits With Combinatorial Strategies Under Stochastic Bandits

We consider the following linearly combinatorial multiarmed bandits (MABs) problem. In a discrete time system, the re are K unknown random variables (RVs), i.e., arms, each evolving as an i.i.d stochastic process over time. At each time slot, w e select a set ofN (N ≤ K) RVs, i.e., strategy, subject to an arbitrarily constraint. We then gain a reward that is a linear combination of observations on selected RVs. Our goal is to minimize the regret, defined as the difference between the summed reward obtained by an optimal static policy that knew the mean of eac h RV, and that obtained by a specified learning policy that does not know. A prior result for this problem has achieved zero regret (the expect of regret over time approaches zero when t ime goes to infinity), but dependent on probability distribution of strategies generated by the learning policy. The regret bec omes arbitrarily large if the difference between the reward of the best and second best strategy approaches zero. Meanwhile, when t here are exponential number of combinations, naive extension of a prior distribution-free policy would cause poor performance in terms of regret, computation and space complexity. We propo se an efficient Distribution-Free Learning (DFL) policy that a chieves zero regret without dependence on probability distribution of strategies. Our learning policy only requires time and spac e complexity O(K). When the linear combination is involved with NP-hard problems, our policy provides a flexible scheme to choose possible approximation algorithms to solve the prob lem efficiently while retaining zero regret.

[1]  Mingyan Liu,et al.  Optimality of Myopic Sensing in Multi-Channel Opportunistic Access , 2008, 2008 IEEE International Conference on Communications.

[2]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[3]  Hans Kellerer,et al.  Knapsack problems , 2004 .

[4]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[5]  W. Marsden I and J , 2012 .

[6]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[7]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8]  Ananthram Swami,et al.  Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.

[9]  Ao Tang,et al.  Opportunistic Spectrum Access with Multiple Users: Learning under Competition , 2010, 2010 Proceedings IEEE INFOCOM.

[10]  Neil Genzlinger A. and Q , 2006 .

[11]  Michel Vasquez,et al.  Improved results on the 0-1 multidimensional knapsack problem , 2005, Eur. J. Oper. Res..

[12]  Mingyan Liu,et al.  Online Learning in Decentralized Multiuser Resource Sharing Problems , 2012, ArXiv.

[13]  Yi Gai,et al.  Decentralized Online Learning Algorithms for Opportunistic Spectrum Access , 2011, 2011 IEEE Global Telecommunications Conference - GLOBECOM 2011.

[14]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[15]  Qing Zhao,et al.  Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics , 2010, IEEE Transactions on Information Theory.

[16]  Klaus Jansen,et al.  Polynomial-Time Approximation Schemes for Geometric Intersection Graphs , 2005, SIAM J. Comput..

[17]  Wei Chen,et al.  Combinatorial multi-armed bandit: general framework, results and applications , 2013, ICML 2013.

[18]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[19]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[20]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[21]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[22]  Rolf H. Möhring,et al.  Partitioning Graphs to Speed Up Dijkstra's Algorithm , 2005, WEA.

[23]  Kurt Mehlhorn,et al.  A Parallelization of Dijkstra's Shortest Path Algorithm , 1998, MFCS.

[24]  Johann Hurink,et al.  A Robust PTAS for Maximum Weight Independent Sets in Unit Disk Graphs , 2004, WG.

[25]  A. Goldberg,et al.  A heuristic improvement of the Bellman-Ford algorithm , 1993 .