Multi-armed Bandits with Constrained Arms and Hidden States

The problem of rested and restless multi-armed bandits with constrained availability of arms is considered. The states of arms evolve in Markovian manner and the exact states are hidden from the decision maker. First, some structural results on value functions are claimed. Following these results, the optimal policy turns out to be a \textit{threshold policy}. Further, \textit{indexability} of rested bandits is established and index formula is derived. The performance of index policy is illustrated and compared with myopic policy using numerical examples.

[1]  Vivek S. Borkar Whittle Index for Partially Observed Binary Markov Decision Processes , 2017, IEEE Transactions on Automatic Control.

[2]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[3]  C. White Optimal control-limit strategies for a partially observed replacement problem† , 1979 .

[4]  Qing Zhao,et al.  Indexability of Restless Bandit Problems and Optimality of Whittle Index for Dynamic Multichannel Access , 2008, IEEE Transactions on Information Theory.

[5]  D. Manjunath,et al.  On the Whittle Index for Restless Multiarmed Hidden Markov Bandits , 2016, IEEE Transactions on Automatic Control.

[6]  William S. Lovejoy Ordered Solutions for Dynamic Programs , 1987, Math. Oper. Res..

[7]  I. Mitrani,et al.  Dynamic routing among several intermittently available servers , 2005, Next Generation Internet Networks, 2005.

[8]  D. Manjunath,et al.  A Hidden Markov Restless Multi-armed Bandit Model for Playout Recommendation Systems , 2017, COMSNETS.

[9]  John N. Tsitsiklis,et al.  The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[10]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[11]  Kevin D. Glazebrook,et al.  Dynamic routing to heterogeneous collections of unreliable servers , 2007, Queueing Syst. Theory Appl..

[12]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[13]  Warrren B Powell,et al.  Index policies for discounted bandit problems with availability constraints , 2008, Advances in Applied Probability.

[14]  William S. Lovejoy,et al.  Some Monotonicity Results for Partially Observed Markov Decision Processes , 1987, Oper. Res..

[15]  Bo Wahlberg,et al.  Partially Observed Markov Decision Process Multiarmed Bandits - Structural Results , 2009, Math. Oper. Res..

[16]  S. Christian Albright,et al.  Structural Results for Partially Observable Markov Decision Processes , 1979, Oper. Res..

[17]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state-information II. The convexity of the lossfunction , 1969 .