Decentralized Learning in Finite Markov Chains: Revisited

The convergence proof in the paper ldquoDecentralized learning in finite Markov chains,rdquo published in the IEEE Transactions on Automatic Control, vol. AC-31, no. 6, pp. 519-526, 1986, is incomplete. This note first provides a sufficient condition for the existence of a unique optimal policy for infinite-horizon average-cost Markov decision processes (MDPs), making the convergence result established by Wheeler and Narendra preserved with the condition. We then present a novel simulation-based decentralized algorithm, called ldquosampled joint-strategy fictitious play for MDPrdquo for average MDPs based on the recent study by Garcia of a decentralized approach to discrete optimization via fictitious play applied to games with identical payoffs. We establish a stronger almost-sure convergence result than Wheeler and Narendra's, showing that the sequence of probability distributions over the policy space for a given MDP generated by the algorithm converges to a unique optimal policy with probability one.

[1]  Apostolos Burnetas,et al.  Computing Optimal Policies for Markovian Decision Processes Using Simulation , 1995 .

[2]  S. Marcus,et al.  Approximate receding horizon approach for Markov decision processes: average reward case , 2003 .

[3]  Alfredo García,et al.  A Decentralized Approach to Discrete Optimization via Simulation: Application to Network Flow , 2007, Oper. Res..

[4]  Hyeong Soo Chang Finite-Step Approximation Error Bounds for Solving Average-Reward-Controlled Markov Set-Chains , 2008, IEEE Transactions on Automatic Control.

[5]  James E. Smith,et al.  Structural Properties of Stochastic Dynamic Programs , 2002, Oper. Res..

[6]  O. Hernández-Lerma,et al.  A forecast horizon and a stopping rule for general Markov decision processes , 1988 .

[7]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8]  K. Narendra,et al.  Decentralized learning in finite Markov chains , 1985, 1985 24th IEEE Conference on Decision and Control.

[9]  Raúl Montes-de-Oca,et al.  Conditions for the uniqueness of optimal policies of discounted Markov decision processes , 2004, Math. Methods Oper. Res..

[10]  Donald M. Topkis,et al.  Minimizing a Submodular Function on a Lattice , 1978, Oper. Res..

[11]  Apostolos Burnetas,et al.  On confidence intervals from simulation of finite Markov chains , 1997, Math. Methods Oper. Res..

[12]  Robert L. Smith,et al.  A Fictitious Play Approach to Large-Scale Optimization , 2005, Oper. Res..

[13]  O. Hernández-Lerma Adaptive Markov Control Processes , 1989 .

[14]  R. Weber,et al.  Optimal control of service rates in networks of queues , 1987, Advances in Applied Probability.

[15]  L. Shapley,et al.  Fictitious Play Property for Games with Identical Interests , 1996 .

[16]  David D. Yao,et al.  Monotone Optimal Control of Permutable GSMPs , 1994, Math. Oper. Res..

[17]  Alfredo García,et al.  A Game-Theoretic Approach to Efficient Power Management in Sensor Networks , 2008, Oper. Res..

[18]  R. Amir Supermodularity and Complementarity in Economics: An Elementary Survey , 2003 .

[19]  William L. Cooper,et al.  CONVERGENCE OF SIMULATION-BASED POLICY ITERATION , 2003, Probability in the Engineering and Informational Sciences.

[20]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[21]  Xi-Ren Cao,et al.  A unified approach to Markov decision problems and performance sensitivity analysis , 2000, at - Automatisierungstechnik.

[22]  D. M. Topkis Supermodularity and Complementarity , 1998 .

[23]  R. Amir,et al.  A LATTICE-THEORETIC APPROACH TO A CLASS OF DYNAMIC GAMES , 1989 .

[24]  J. Robinson AN ITERATIVE METHOD OF SOLVING A GAME , 1951, Classics in Game Theory.

[25]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.