论文信息 - Continuous-time Markov decision process with average reward: Using reinforcement learning method

Continuous-time Markov decision process with average reward: Using reinforcement learning method

Markov decision process (MDP) is a foundational framework of reinforcement learning advanced in sequential decision problems. Continuous-time Markov decision process (CTMDP) extends the discrete time MDP model by allowing actions to take place at any time. Prior work has little consideration on the reinforcement learning methods for solving CTMDPs. The aim of our article was to present a reinforcement learning approach based on the path of samples. For the key concept of performance potential function, a policy iteration algorithm with average reward was presented. Then, through the Robbins-Monro method, a temporal difference formula for evaluating the performance potential function was also proposed. Simulation results indicated that the presented algorithms could converge to the solution of the CTMDP problem at a proper speed.

[1] Michael O. Duff,et al. Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[2] Manuela M. Veloso,et al. Decentralized MDPs with sparse interactions , 2011, Artif. Intell..

[3] J. Banks,et al. Discrete-Event System Simulation , 1995 .

[4] Xianping Guo,et al. Continuous-Time Markov Decision Processes: Theory and Applications , 2009 .

[5] Roger W. Brockett,et al. Optimal control of observable continuous time Markov chains , 2008, 2008 47th IEEE Conference on Decision and Control.

[6] Xi-Ren Cao,et al. Stochastic learning and optimization - A sensitivity-based approach , 2007, Annual Reviews in Control.

[7] Xi Chen,et al. Policy iteration based feedback control , 2008, Autom..

[8] Peter Norvig,et al. Artificial Intelligence: A Modern Approach , 1995 .