Q-learning for Markov decision processes with a satisfiability criterion

Abstract A reinforcement learning algorithm is proposed in order to solve a multi-criterion Markov decision process, i.e., an MDP with a vector running cost. Specifically, it combines a Q-learning scheme for a weighted linear combination of the prescribed running costs with an incremental version of replicator dynamics that updates the weights. The objective is that the time averaged vector cost meets prescribed asymptotic bounds. Under mild assumptions, it is shown that the scheme achieves the desired objective.

[1]  D. Leslie,et al.  Asynchronous stochastic approximation with differential inclusions , 2011, 1112.2288.

[2]  Abhijeet Bhorkar,et al.  An on-line learning algorithm for energy efficient delay constrained scheduling over a fading channel , 2008, IEEE Journal on Selected Areas in Communications.

[3]  J. Aubin,et al.  Differential inclusions set-valued maps and viability theory , 1984 .

[4]  Daniel H. Wagner Survey of Measurable Selection Theorems , 1977 .

[5]  Shalabh Bhatnagar,et al.  The Borkar-Meyn theorem for asynchronous stochastic approximations , 2011, Syst. Control. Lett..

[6]  Vivek S. Borkar,et al.  Structural Properties of Optimal Transmission Policies Over a Randomly Varying Channel , 2008, IEEE Transactions on Automatic Control.

[7]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[8]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[9]  Emanuel Milman Approachable sets of vector payoffs in stochastic games , 2006, Games Econ. Behav..

[10]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[11]  Sanjeev Arora,et al.  The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[12]  Schäl Manfred Estimation and control in discounted stochastic dynamic programming , 1987 .

[13]  Vivek S. Borkar,et al.  Approachability in Stackelberg Stochastic Games with Vector Costs , 2017, Dyn. Games Appl..

[14]  Josef Hofbauer,et al.  Stochastic Approximations and Differential Inclusions , 2005, SIAM J. Control. Optim..

[15]  A. Shwartz,et al.  Guaranteed performance regions in Markovian systems with competing decision makers , 1993, IEEE Trans. Autom. Control..

[16]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[17]  Vivek S. Borkar,et al.  Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[18]  William H. Sandholm,et al.  The projection dynamic and the geometry of population games , 2008, Games Econ. Behav..

[19]  V. Borkar Stochastic approximation with two time scales , 1997 .

[20]  Josef Hofbauer,et al.  Evolutionary Games and Population Dynamics , 1998 .

[21]  Anna Nagurney,et al.  Dynamical systems and variational inequalities , 1993, Ann. Oper. Res..

[22]  William H. Sandholm,et al.  Population Games And Evolutionary Dynamics , 2010, Economic learning and social evolution.

[23]  V. Borkar Asynchronous Stochastic Approximations , 1998 .

[24]  Josef Hofbauer,et al.  Stochastic Approximations and Differential Inclusions, Part II: Applications , 2006, Math. Oper. Res..