论文信息 - Markov Game with Switching Costs

Markov Game with Switching Costs

We study a general Markov game with metric switching costs: in each round, the player adaptively chooses one of several Markov chains to advance with the objective of minimizing the expected cost for at least k chains to reach their target states. If the player decides to play a di erent chain, an additional switching cost is incurred. e special case in which there is no switching cost was solved optimally by Dumitriu, Tetali and Winkler [DTW03] by a variant of the celebrated Gi ins Index for the classical multi-armed bandit (MAB) problem with Markovian rewards [Git74, Git79]. However, for multi-armed bandit (MAB) with nontrivial switching cost, even if the switching cost is a constant, the classic paper by Banks and Sundaram [BS94] showed that no index strategy can be optimal. 1 In this paper, we complement their result and show there is a simple index strategy that achieves a constant approximation factor if the switching cost is constant and k = 1. To the best of our knowledge, this is the rst index strategy that achieves a constant approximation factor for a general MAB variant with switching costs. For the general metric, we propose a more involved constant-factor approximation algorithm, via an nontrivial reduction to the stochastic k-TSP problem, in which a Markov chain is approximated by a random variable. Our analysis makes extensive use of various interesting properties of Gi ins index. ∗Institute for Interdisciplinary Information Sciences, TsinghuaUniversity. Email:lijian83@mail.tsinghua.edu.cn. †Paul G. Allen School of Computer Science & Engineering, University of Washington. Part of work was done while visiting Shanghai Qi Zhi Institute. Email: dgliu@cs.washington.edu. 1 eir proof is for the discounted version of MAB, but can be extended to our se ing. See Appendix D for the details. ar X iv :2 10 7. 05 82 2v 1 [ cs .D S] 1 3 Ju l 2 02 1

Daogao Liu | Jian Li | Jian Li | Daogao Liu

[1] Nikhil Bansal,et al. On the adaptivity gap of stochastic orienteering , 2013, Mathematical Programming.

[2] Ashish Goel,et al. Improved approximation results for stochastic knapsack problems , 2011, SODA '11.

[3] Roi Livni,et al. Multi-Armed Bandits with Metric Movement Costs , 2017, NIPS.

[4] Eli Upfal,et al. Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[5] W. Viscusi. A Theory of Job Shopping: A Bayesian Perspective , 1980 .

[6] Jian Li,et al. Stochastic combinatorial optimization via poisson approximation , 2012, STOC '13.

[7] T. Brenner,et al. On the Behavior of Proposers in Ultimatum Games , 2003 .

[8] Max-Olivier Hongler,et al. Optimal hysteresis for a class of deterministic deteriorating two-armed Bandit problem with switching costs , 2003, Autom..

[9] J. Banks,et al. Switching Costs and the Gittins Index , 1994 .

[10] Christian M. Ernst,et al. Multi-armed Bandit Allocation Indices , 1989 .

[11] William R. Johnson,et al. A Theory of Job Shopping , 1978 .

[12] Lones Smith,et al. Optimal job search in a changing world , 1999 .

[13] Michael Waldman,et al. Job Assignments, Signalling, and Efficiency , 1984 .

[14] Viswanath Nagarajan,et al. Approximation Algorithms for Stochastic k-TSP , 2016, FSTTCS.

[15] M. Rothschild. A two-armed bandit theory of market pricing , 1974 .

[16] Sahil Singla,et al. The Price of Information in Combinatorial Optimization , 2017, SODA.

[17] K. Schlag. Why Imitate, and If So, How?, : A Boundedly Rational Approach to Multi-armed Bandits , 1998 .

[18] Jan Vondrák,et al. Approximating the stochastic knapsack problem: the benefit of adaptivity , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[19] A. McLennan. Price dispersion and incomplete learning in the long run , 1984 .

[20] Rina Azoulay-Schwartz,et al. Exploitation vs. exploration: choosing a supplier in an environment of incomplete information , 2004, Decis. Support Syst..

[21] R. Weber. On the Gittins Index for Multiarmed Bandits , 1992 .

[22] Jun Yu Li,et al. Approximation Algorithms for Stochastic Combinatorial Optimization Problems , 2016, Journal of the Operations Research Society of China.

[23] J. Gittins. Bandit processes and dynamic allocation indices , 1979 .

[24] Kevin D. Glazebrook,et al. Gittins Indices and Oil Exploration , 1992 .

[25] Daniel Krähmer,et al. Entry and experimentation in oligopolistic markets for experience goods , 2003 .

[26] Essays on decision theory : effects of changes in environment on decisions , 2001 .

[27] Demosthenis Teneketzis,et al. Multi-armed bandits with switching penalties , 1996, IEEE Trans. Autom. Control..

[28] John N. Tsitsiklis,et al. The complexity of optimal queueing network control , 1994, Proceedings of IEEE 9th Annual Conference on Structure in Complexity Theory.

[29] D. Freedman. On Tail Probabilities for Martingales , 1975 .

[30] Sudipto Guha,et al. Multi-armed Bandits with Metric Switching Costs , 2009, ICALP.

[31] Boyan Jovanovic,et al. Matching, Turnover, and Unemployment , 1984, Journal of Political Economy.

[32] Peter Winkler,et al. On Playing Golf with Two Balls , 2003, SIAM J. Discret. Math..

[33] Laura Doval,et al. Whether or not to open Pandora's box , 2018, J. Econ. Theory.

[34] P. Whittle. Restless Bandits: Activity Allocation in a Changing World , 1988 .

[35] Mikhail M. Klimenko,et al. Industrial Targeting, Experimentation and Long-Run Specialization , 1998 .

[36] Mark P. Van Oyen,et al. Properties of Optimal-Weighted Flowtime Policies with a Makespan Constraint and Set-up Times , 2000, Manuf. Serv. Oper. Manag..

[37] E. Glen Weyl,et al. Descending Price Optimally Coordinates Search , 2016, EC.

[38] Gideon Weiss,et al. Four proofs of Gittins’ multiarmed bandit theorem , 2016, Ann. Oper. Res..

[39] D. Bergemann,et al. Stationary Multi Choice Bandit Problems , 2001 .

[40] Haotian Jiang,et al. Algorithms and Adaptivity Gaps for Stochastic k-TSP , 2019, ITCS.

[41] R. Ravi,et al. Approximation algorithms for stochastic orienteering , 2012, SODA.

[42] M. Weitzman. Optimal search for the best alternative , 1978 .

[43] G. MacDonald. Person-Specific Information in the Labor Market , 1980, Journal of Political Economy.

[44] J. A. Bather,et al. Oil exploration: sequential decisions in the face of uncertainty , 1988, Journal of Applied Probability.