Markov Game with Switching Costs

We study a general Markov game with metric switching costs: in each round, the player adaptively chooses one of several Markov chains to advance with the objective of minimizing the expected cost for at least k chains to reach their target states. If the player decides to play a di erent chain, an additional switching cost is incurred. e special case in which there is no switching cost was solved optimally by Dumitriu, Tetali and Winkler [DTW03] by a variant of the celebrated Gi ins Index for the classical multi-armed bandit (MAB) problem with Markovian rewards [Git74, Git79]. However, for multi-armed bandit (MAB) with nontrivial switching cost, even if the switching cost is a constant, the classic paper by Banks and Sundaram [BS94] showed that no index strategy can be optimal. 1 In this paper, we complement their result and show there is a simple index strategy that achieves a constant approximation factor if the switching cost is constant and k = 1. To the best of our knowledge, this is the rst index strategy that achieves a constant approximation factor for a general MAB variant with switching costs. For the general metric, we propose a more involved constant-factor approximation algorithm, via an nontrivial reduction to the stochastic k-TSP problem, in which a Markov chain is approximated by a random variable. Our analysis makes extensive use of various interesting properties of Gi ins index. ∗Institute for Interdisciplinary Information Sciences, TsinghuaUniversity. Email:lijian83@mail.tsinghua.edu.cn. †Paul G. Allen School of Computer Science & Engineering, University of Washington. Part of work was done while visiting Shanghai Qi Zhi Institute. Email: dgliu@cs.washington.edu. 1 eir proof is for the discounted version of MAB, but can be extended to our se ing. See Appendix D for the details. ar X iv :2 10 7. 05 82 2v 1 [ cs .D S] 1 3 Ju l 2 02 1

[1]  Nikhil Bansal,et al.  On the adaptivity gap of stochastic orienteering , 2013, Mathematical Programming.

[2]  Ashish Goel,et al.  Improved approximation results for stochastic knapsack problems , 2011, SODA '11.

[3]  Roi Livni,et al.  Multi-Armed Bandits with Metric Movement Costs , 2017, NIPS.

[4]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[5]  W. Viscusi A Theory of Job Shopping: A Bayesian Perspective , 1980 .

[6]  Jian Li,et al.  Stochastic combinatorial optimization via poisson approximation , 2012, STOC '13.

[7]  T. Brenner,et al.  On the Behavior of Proposers in Ultimatum Games , 2003 .

[8]  Max-Olivier Hongler,et al.  Optimal hysteresis for a class of deterministic deteriorating two-armed Bandit problem with switching costs , 2003, Autom..

[9]  J. Banks,et al.  Switching Costs and the Gittins Index , 1994 .

[10]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[11]  William R. Johnson,et al.  A Theory of Job Shopping , 1978 .

[12]  Lones Smith,et al.  Optimal job search in a changing world , 1999 .

[13]  Michael Waldman,et al.  Job Assignments, Signalling, and Efficiency , 1984 .

[14]  Viswanath Nagarajan,et al.  Approximation Algorithms for Stochastic k-TSP , 2016, FSTTCS.

[15]  M. Rothschild A two-armed bandit theory of market pricing , 1974 .

[16]  Sahil Singla,et al.  The Price of Information in Combinatorial Optimization , 2017, SODA.

[17]  K. Schlag Why Imitate, and If So, How?, : A Boundedly Rational Approach to Multi-armed Bandits , 1998 .

[18]  Jan Vondrák,et al.  Approximating the stochastic knapsack problem: the benefit of adaptivity , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[19]  A. McLennan Price dispersion and incomplete learning in the long run , 1984 .

[20]  Rina Azoulay-Schwartz,et al.  Exploitation vs. exploration: choosing a supplier in an environment of incomplete information , 2004, Decis. Support Syst..

[21]  R. Weber On the Gittins Index for Multiarmed Bandits , 1992 .

[22]  Jun Yu Li,et al.  Approximation Algorithms for Stochastic Combinatorial Optimization Problems , 2016, Journal of the Operations Research Society of China.

[23]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[24]  Kevin D. Glazebrook,et al.  Gittins Indices and Oil Exploration , 1992 .

[25]  Daniel Krähmer,et al.  Entry and experimentation in oligopolistic markets for experience goods , 2003 .

[26]  Essays on decision theory : effects of changes in environment on decisions , 2001 .

[27]  Demosthenis Teneketzis,et al.  Multi-armed bandits with switching penalties , 1996, IEEE Trans. Autom. Control..

[28]  John N. Tsitsiklis,et al.  The complexity of optimal queueing network control , 1994, Proceedings of IEEE 9th Annual Conference on Structure in Complexity Theory.

[29]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[30]  Sudipto Guha,et al.  Multi-armed Bandits with Metric Switching Costs , 2009, ICALP.

[31]  Boyan Jovanovic,et al.  Matching, Turnover, and Unemployment , 1984, Journal of Political Economy.

[32]  Peter Winkler,et al.  On Playing Golf with Two Balls , 2003, SIAM J. Discret. Math..

[33]  Laura Doval,et al.  Whether or not to open Pandora's box , 2018, J. Econ. Theory.

[34]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[35]  Mikhail M. Klimenko,et al.  Industrial Targeting, Experimentation and Long-Run Specialization , 1998 .

[36]  Mark P. Van Oyen,et al.  Properties of Optimal-Weighted Flowtime Policies with a Makespan Constraint and Set-up Times , 2000, Manuf. Serv. Oper. Manag..

[37]  E. Glen Weyl,et al.  Descending Price Optimally Coordinates Search , 2016, EC.

[38]  Gideon Weiss,et al.  Four proofs of Gittins’ multiarmed bandit theorem , 2016, Ann. Oper. Res..

[39]  D. Bergemann,et al.  Stationary Multi Choice Bandit Problems , 2001 .

[40]  Haotian Jiang,et al.  Algorithms and Adaptivity Gaps for Stochastic k-TSP , 2019, ITCS.

[41]  R. Ravi,et al.  Approximation algorithms for stochastic orienteering , 2012, SODA.

[42]  M. Weitzman Optimal search for the best alternative , 1978 .

[43]  G. MacDonald Person-Specific Information in the Labor Market , 1980, Journal of Political Economy.

[44]  J. A. Bather,et al.  Oil exploration: sequential decisions in the face of uncertainty , 1988, Journal of Applied Probability.