Reinforcement Learning and Markov Decision Processes

Situated in between supervised learning and unsupervised learning, the paradigm of reinforcement learning deals with learning in sequential decision making problems in which there is limited feedback. This text introduces the intuitions and concepts behind Markov decision processes and two classes of algorithms for computing optimal behaviors: reinforcement learning and dynamic programming. First the formal framework of Markov decision process is defined, accompanied by the definition of value functions and policies. The main part of this text deals with introducing foundational classes of algorithms for learning optimal behaviors, based on various definitions of optimality with respect to the goal of learning sequential decisions. Additionally, it surveys efficient extensions of the foundational algorithms, differing mainly in the way feedback given by the environment is used to speed up learning, and in the way they concentrate on relevant parts of the problem. For both model-based and model-free settings these efficient extensions have shown useful in scaling up to larger problems.

[1]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[2]  Marco Colombetti,et al.  Robot Shaping: An Experiment in Behavior Engineering , 1997 .

[3]  Andrew G. Barto,et al.  Autonomous shaping: knowledge transfer in reinforcement learning , 2006, ICML.

[4]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5]  Bohdana Ratitch On characteristics of markov decision processes and reinforcement learning in large domains , 2005 .

[6]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[7]  Sven Koenig,et al.  The interaction of representations and planning objectives for decision-theoretic planning tasks , 2002, J. Exp. Theor. Artif. Intell..

[8]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[9]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[10]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[11]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[12]  Gary L. Drescher,et al.  Made-up minds - a constructivist approach to artificial intelligence , 1991 .

[13]  Stuart J. Russell,et al.  Control Strategies for a Stochastic Planner , 1994, AAAI.

[14]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[15]  De,et al.  Relational Reinforcement Learning , 2001, Encyclopedia of Machine Learning and Data Mining.

[16]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[17]  Richard S. Sutton,et al.  Reinforcement learning architectures for animats , 1991 .

[18]  Maja J. Mataric,et al.  Reward Functions for Accelerated Learning , 1994, ICML.

[19]  Michael Wooldridge,et al.  Artificial Intelligence Today , 1999, Lecture Notes in Computer Science.

[20]  Marco Wiering Model-based reinforcement learning in dynamic environments , 2002 .

[21]  Marcus A. Maloof,et al.  Incremental rule learning with partial instance memory for changing concepts , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[22]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[23]  K. A. F. Ramling Bi-Memory Model for Guiding Exploration by Pre-existing Knowledge , 2005 .

[24]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[25]  Axel van Lamsweerde,et al.  Learning machine learning , 1991 .

[26]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[27]  K. R. Dixon,et al.  Incorporating Prior Knowledge and Previously Learned Information into Reinforcement Learning Agents , 2000 .

[28]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[29]  Geoffrey J. Gordon,et al.  Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees , 2005, ICML.

[30]  Marco Wiering,et al.  Explorations in efficient reinforcement learning , 1999 .

[31]  Matthijs T. J. Spaan,et al.  Approximate planning under uncertainty in partially observable environments , 2002 .

[32]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[33]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[34]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[35]  Blai Bonet,et al.  Faster Heuristic Search Algorithms for Planning with Uncertainty and Full Feedback , 2003, IJCAI.

[36]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[37]  Jonathan Schaeffer,et al.  Kasparov versus Deep Blue: The Rematch , 1997, J. Int. Comput. Games Assoc..

[38]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[39]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[40]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[41]  Marco Wiering QV(λ)-learning: A New On-policy Reinforcement Learning Algorithm , 2005 .

[42]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[43]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[44]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[45]  Bernard Widrow,et al.  THE TRUCK BACKER-UPPER , 1990 .

[46]  Marco Wiering QV(lambda)-learning: A New On-policy Reinforcement Learning Algrithm , 2005 .

[47]  Jürgen Schmidhuber,et al.  Efficient model-based exploration , 1998 .

[48]  Geoffrey J. Gordon,et al.  Point-based approximations for fast POMDP solving , 2006 .

[49]  Blai Bonet,et al.  Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming , 2003, ICAPS.

[50]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[51]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[52]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[53]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[54]  W. Matthews Mazes and Labyrinths: A General Account of Their History and Developments , 2015, Nature.

[55]  Eric R. Zieyel Operations research : applications and algorithms , 1988 .

[56]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[57]  Shlomo Zilberstein,et al.  LAO*: A heuristic search algorithm that finds solutions with loops , 2001, Artif. Intell..

[58]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[59]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[60]  Craig Boutilier,et al.  Knowledge Representation for Stochastic Decision Process , 1999, Artificial Intelligence Today.

[61]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[62]  Martijn van Otterlo,et al.  A survey of reinforcement learning in relational domains , 2005 .

[63]  Adaptive State-Space Quantisation and Multi-Task Reinforcement Learning Using . . . , 2000 .

[64]  R. Bellman Dynamic programming. , 1957, Science.

[65]  Jürgen Schmidhuber,et al.  Fast Online Q(λ) , 1998, Machine Learning.

[66]  Anthony Stentz,et al.  Focused Dynamic Programming: Extensive Comparative Results , 2004 .

[67]  Martijn van Otterlo,et al.  The Logic of Adaptive Behavior - Knowledge Representation and Algorithms for Adaptive Sequential Decision Making under Uncertainty in First-Order and Relational Domains , 2009, Frontiers in Artificial Intelligence and Applications.

[68]  SRIDHAR MAHADEVAN,et al.  Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.

[69]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[70]  Nicholas Kushmerick,et al.  An Algorithm for Probabilistic Planning , 1995, Artif. Intell..