论文信息 - Combinatorial Multi-Armed Bandit with General Reward Functions - 字舞流文

Combinatorial Multi-Armed Bandit with General Reward Functions

In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework that allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables. Our framework enables a much larger class of reward functions such as the max() function and nonlinear utility functions. Existing techniques relying on accurate estimations of the means of random variables, such as the upper confidence bound (UCB) technique, do not work directly on these functions. We propose a new algorithm called stochastically dominant confidence bound (SDCB), which estimates the distributions of underlying random variables and their stochastically dominant confidence bounds. We prove that SDCB can achieve O(log T) distribution-dependent regret and O(√T) distribution-independent regret, where T is the time horizon. We apply our results to the K-MAX problem and expected utility maximization problems. In particular, for K-MAX, we provide the first polynomial-time approximation scheme (PTAS) for its offline problem, and give the first O(√T) bound on the (1 — e)-approximation regret of its online problem, for any e > 0.

Wei Chen | Yu Liu | Wei Hu | Jian Li | Fu Li | Pinyan Lu | P. Lu | Wei Hu | Wei Chen | J. Li | Yu Liu | Fu Li

[1] Yajun Wang,et al. Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms , 2014, J. Mach. Learn. Res..

[2] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[3] Sudipto Guha,et al. Asking the right questions: model-driven optimization using probes , 2006, PODS.

[4] Bhaskar Krishnamachari,et al. Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[5] Wei Chen,et al. Combinatorial Partial Monitoring Game with Linear Feedback and Its Applications , 2014, ICML.

[6] Zheng Wen,et al. Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2014, AISTATS.

[7] Zheng Wen,et al. Matroid Bandits: Fast Combinatorial Optimization with Learning , 2014, UAI.

[8] P. Massart. The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[9] Jian Li,et al. Stochastic combinatorial optimization via poisson approximation , 2012, STOC '13.

[10] Nicolò Cesa-Bianchi,et al. Combinatorial Bandits , 2012, COLT.

[11] Shie Mannor,et al. Thompson Sampling for Complex Online Problems , 2013, ICML.

[12] J. Kiefer,et al. Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .

[13] Shabbir Ahmed,et al. Maximizing expected utility over a knapsack constraint , 2016, Oper. Res. Lett..

[14] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[15] P. Fishburn. The Foundations Of Expected Utility , 2010 .

[16] M. L. Fisher,et al. An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[17] Matthew J. Streeter,et al. An Online Algorithm for Maximizing Submodular Functions , 2008, NIPS.

[18] Wei Chen,et al. Combinatorial Pure Exploration of Multi-Armed Bandits , 2014, NIPS.

[19] Jean-Yves Audibert,et al. Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[20] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[21] Sanjeev Khanna,et al. A Utility Equivalence Theorem for Concave Functions , 2014, IPCO.

[22] Alexandre Proutière,et al. Combinatorial Bandits Revisited , 2015, NIPS.

[23] Wei Chen,et al. Stochastic Online Greedy Learning with Semi-bandit Feedbacks , 2015, NIPS.

[24] Sudipto Guha,et al. How to probe for an extreme value , 2010, TALG.

[25] Wtt Wtt. Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2015 .

[26] H. Robbins,et al. Asymptotically efficient adaptive allocation rules , 1985 .

[27] Zheng Wen,et al. Combinatorial Cascading Bandits , 2015, NIPS.

[28] Jian Li,et al. Maximizing Expected Utility for Stochastic Combinatorial Optimization Problems , 2010, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.