MDPs: Learning in Varying Environments

In this paper e-MDP-models are introduced and convergence theorems are proven using the generalized MDP framework of Szepesvari and Littman. Using this model family, we show that Q-learning is capable of finding near-optimal policies in varying environments. The potential of this new family of MDP models is illustrated via a reinforcement learning algorithm called event-learning which separates the optimization of decision making from the controller. We show that event-learning augmented by a particular controller, which gives rise to an e-MDP, enables near optimal performance even if considerable and sudden changes may occur in the environment. Illustrations are provided on the two-segment pendulum problem.

[1]  Andrew G. Barto,et al.  DISCRETE AND CONTINUOUS MODELS , 1978 .

[2]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[3]  Satinder P. Singh,et al.  Scaling Reinforcement Learning Algorithms by Learning Variable Temporal Resolution Models , 1992, ML.

[4]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[5]  Narendra Ahuja,et al.  Gross motion planning—a survey , 1992, CSUR.

[6]  Andrew G. Barto,et al.  Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms , 1993, NIPS.

[7]  Leslie Pack Kaelbling,et al.  Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[8]  Piero Mussio,et al.  Toward a Practice of Autonomous Systems , 1994 .

[9]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[10]  George H. John When the Best Move Isn't Optimal: Q-learning with Exploration , 1994, AAAI.

[11]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[12]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[14]  Katsuhisa Furuta,et al.  Robust swing up control of double pendulum , 1995, Proceedings of 1995 American Control Conference - ACC'95.

[15]  Kenji Doya,et al.  Temporal Difference Learning in Continuous Time and Space , 1995, NIPS.

[16]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[17]  András Lörincz,et al.  Self-Organizing Multi-Resolution Grid for Motion Planning and Control , 1996, Int. J. Neural Syst..

[18]  Csaba Szepesv Ari,et al.  Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms , 1996 .

[19]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[20]  András Lörincz,et al.  Neurocontroller using dynamic state feedback for compensatory control , 1997, Neural Networks.

[21]  Maja J. Matari,et al.  Behavior-based Control: Examples from Navigation, Learning, and Group Behavior , 1997 .

[22]  Maja J. Mataric,et al.  Behaviour-based control: examples from navigation, learning, and group behaviour , 1997, J. Exp. Theor. Artif. Intell..

[23]  Csaba Szepesvri,et al.  An integrated architecture for motion‐control and path‐planning , 1998 .

[24]  R. Sutton Between MDPs and Semi-MDPs : Learning , Planning , and Representing Knowledge at Multiple Temporal Scales , 1998 .

[25]  Doina Precup,et al.  Between MOPs and Semi-MOP: Learning, Planning & Representing Knowledge at Multiple Temporal Scales , 1998 .

[26]  Csaba Szepesvari Static and Dynamic Aspects of Optimal Sequential Decision Making , 1998 .

[27]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[28]  Csaba Szepesvári,et al.  Approximate Inverse-Dynamics Based Robust Control Using Static And Dynamic Feedback , 1998 .

[29]  András Lörincz,et al.  An integrated architecture for motion-control and path-planning , 1998, J. Field Robotics.

[30]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[31]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[32]  Robert Givan,et al.  Bounded-parameter Markov decision processes , 2000, Artif. Intell..

[33]  S.H.G. ten Hagen Continuous State Space Q-Learning for control of Nonlinear Systems , 2001 .

[34]  Frank van Harmelen,et al.  Proceedings of the 15th European Conference on Artificial Intelligence , 2002 .

[35]  András Lörincz,et al.  Event-learning with a non-Markovian controller , 2002 .

[36]  András Lörincz,et al.  Reinforcement Learning Integrated with a Non-Markovian Controller , 2002, ECAI.

[37]  András Lörincz,et al.  Event-learning and robust policy heuristics , 2003, Cognitive Systems Research.

[38]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[39]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[40]  András Lörincz,et al.  Module-Based Reinforcement Learning: Experiments with a Real Robot , 1998, Machine Learning.

[41]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.