论文信息 - Multi-armed bandit problems with dependent arms

Multi-armed bandit problems with dependent arms

We provide a framework to exploit dependencies among arms in multi-armed bandit problems, when the dependencies are in the form of a generative model on clusters of arms. We find an optimal MDP-based policy for the discounted reward case, and also give an approximation of it with formal error guarantee. We discuss lower bounds on regret in the undiscounted reward scenario, and propose a general two-level bandit policy for it. We propose three different instantiations of our general policy and provide theoretical justifications of how the regret of the instantiated policies depend on the characteristics of the clusters. Finally, we empirically demonstrate the efficacy of our policies on large-scale real-world and synthetic data, and show that they significantly outperform classical policies designed for bandits with independent arms.

[1] J. Gittins. Bandit processes and dynamic allocation indices , 1979 .

[2] P. Whittle. Multi‐Armed Bandits and the Gittins Index , 1980 .

[3] T. Lai. Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[4] T. Lai,et al. Optimal stopping and dynamic allocation , 1987, Advances in Applied Probability.

[5] David J. C. MacKay,et al. Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[6] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7] R. Agrawal. Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[8] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[9] H. Vincent Poor,et al. Bandit problems with side observations , 2005, IEEE Transactions on Automatic Control.

[10] Sanjoy Dasgupta,et al. Coarse sample complexity bounds for active learning , 2005, NIPS.

[11] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[12] Deepayan Chakrabarti,et al. Bandits for Taxonomies: A Model-based Approach , 2007, SDM.

[13] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .