论文信息 - Model-building semi-Markov adaptive critics

Model-building semi-Markov adaptive critics

Adaptive or actor critics are a class of reinforcement learning (RL) or approximate dynamic programming (ADP) algorithms in which one searches over stochastic policies in order to determine the optimal deterministic policy. Classically, these algorithms have been studied for Markov decision processes (MDPs) in the context of model-free updates in which transition probabilities are avoided altogether. A model-free version for the semi-MDP (SMDP) for discounted reward in which the transition time of each transition can be a random variable was proposed in Gosavi [1]. In this paper, we propose a variant in which the transition probability model is built simultaneously with the value function and action-probability functions. While our new algorithm does not require the transition probabilities apriori, it generates them along with the estimation of the value function and the action-probability functions required in adaptive critics. Model-building and model-based versions of algorithms have numerous advantages in contrast to their model-free counterparts. In particular, they are more stable and may require less training. However the additional steps of building the model may require increased storage in the computer's memory. In addition to enumerating potential application areas for our algorithm, we will analyze the advantages and disadvantages of model building.

[1] Abhijit Gosavi,et al. Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[2] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[3] Emmanuel Fernandez,et al. Control of a re-entrant line manufacturing model with a reinforcement learning approach , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[4] Abhijit Gosavi,et al. Reinforcement learning for long-run average cost , 2004, Eur. J. Oper. Res..

[5] V. Borkar. Stochastic approximation with two time scales , 1997 .

[6] Paul J. Werbos,et al. Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[7] Richard S. Sutton,et al. Reinforcement Learning , 1992, Handbook of Machine Learning.

[8] Shin Ishii,et al. A model-based reinforcement learning: a computational model and an fMRI study , 2003, ESANN.

[9] Jürgen Schmidhuber,et al. Model-based reinforcement learning for evolving soccer strategies , 2001 .

[10] R. J. Williams,et al. On the use of backpropagation in associative reinforcement learning , 1988, IEEE 1988 International Conference on Neural Networks.

[11] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[12] Abhijit Gosavi,et al. Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[13] S. Mahadevan,et al. Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning , 1999 .

[14] Tapas K. Das,et al. A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking , 2002 .

[15] Prasad Tadepalli,et al. Model-Based Average Reward Reinforcement Learning , 1998, Artif. Intell..

[16] Richard S. Sutton,et al. A Menu of Designs for Reinforcement Learning Over Time , 1995 .

[17] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[18] Junichiro Yoshimoto,et al. Control of exploitation-exploration meta-parameter in reinforcement learning , 2002, Neural Networks.

[19] S. Shankar Sastry,et al. Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[20] Abhijit Gosavi,et al. Model-Building for Robust Reinforcement Learning , 2010 .

[21] A. Barto,et al. ModelBased Adaptive Critic Designs , 2004 .

[22] Pieter Abbeel,et al. Autonomous Autorotation of an RC Helicopter , 2008, ISER.

[23] Paul J. Werbos,et al. Consistency of HDP applied to a simple reinforcement learning problem , 1990, Neural Networks.

[24] R. Bellman. Dynamic programming. , 1957, Science.

[25] Abhijit Gosavi. Reinforcement learning for model building and variance-penalized control , 2009, Proceedings of the 2009 Winter Simulation Conference (WSC).

[26] Abhijit Gosavi,et al. A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis , 2004, Machine Learning.

[27] Abhijit Gosavi,et al. Semi-Markov adaptive critic heuristics with application to airline revenue management , 2011 .

[28] Vivek S. Borkar,et al. Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[29] Shalabh Bhatnagar,et al. Actor-critic algorithms for hierarchical Markov decision processes , 2006, Autom..

[30] Abhijit Gosavi. Adaptive Critics for Airline Revenue Management , 2007 .

[31] Steven I. Marcus,et al. Simulation-based Algorithms for Markov Decision Processes/ Hyeong Soo Chang ... [et al.] , 2013 .

[32] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[33] Ashutosh Saxena,et al. High speed obstacle avoidance using monocular vision and reinforcement learning , 2005, ICML.

[34] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[35] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[36] Panos M. Pardalos,et al. Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[37] Ronald A. Howard,et al. Dynamic Programming and Markov Processes , 1960 .

[38] Mala Gosakan,et al. Human performance modeling for emergency management decision making , 2010 .

[39] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[40] Andrew G. Barto,et al. Reinforcement learning , 1998 .

[41] Warren B. Powell,et al. Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .