Feature-based aggregation and deep reinforcement learning: a survey and some new implementations

In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller “ aggregate ” Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural network-based reinforcement learning, thereby potentially leading to more effective policy improvement.

[1]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[3]  A. G. Ivakhnenko,et al.  Polynomial Theory of Complex Systems , 1971, IEEE Trans. Syst. Man Cybern..

[4]  M. A. Krasnoselʹskii,et al.  Approximate Solution of Operator Equations , 1972 .

[5]  W. Miranker,et al.  Acceleration by aggregation of successive approximation methods , 1982 .

[6]  Roy Mendelssohn,et al.  An Iterative Aggregation Procedure for Markov Decision Processes , 1982, Oper. Res..

[7]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[8]  John H. Holland,et al.  Escaping brittleness: the possibilities of general-purpose learning algorithms applied to parallel rule-based systems , 1995 .

[9]  Richard E. Korf,et al.  A Unified Theory of Heuristic Evaluation Functions and its Application to Learning , 1986, AAAI.

[10]  Robert L. Smith,et al.  Aggregation in Dynamic Programming , 1987, Oper. Res..

[11]  Gerald Tesauro,et al.  Connectionist Learning of Expert Preferences by Comparison Training , 1988, NIPS.

[12]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[13]  Gerald Tesauro,et al.  Neurogammon Wins Computer Olympiad , 1989, Neural Computation.

[14]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[15]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[16]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[17]  Bruce Abramson,et al.  Expected-Outcome: A General Model of Static Evaluation , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  L. Jones Constructive approximations for neural networks by sigmoidal functions , 1990, Proc. IEEE.

[19]  James R. Evans,et al.  Aggregation and Disaggregation Techniques and Methodology in Optimization , 1991, Oper. Res..

[20]  John N. Tsitsiklis,et al.  An Analysis of Stochastic Shortest Path Problems , 1991, Math. Oper. Res..

[21]  Timothy Masters,et al.  Multilayer Feedforward Networks , 1993 .

[22]  J. Douglas,et al.  A unified convergence theory for abstract multigrid or multilevel algorithms, serial and parallel , 1993 .

[23]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1993, Proceedings of 32nd IEEE Conference on Decision and Control.

[24]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[25]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[26]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[27]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[28]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[29]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[30]  Gerald Tesauro,et al.  TD-Gammon: A Self-Teaching Backgammon Program , 1995 .

[31]  Dimitri P. Bertsekas,et al.  A Counterexample to Temporal Differences Learning , 1995, Neural Computation.

[32]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[33]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[34]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[35]  A. Kirsch An Introduction to the Mathematical Theory of Inverse Problems , 1996, Applied Mathematical Sciences.

[36]  John N. Tsitsiklis,et al.  Rollout Algorithms for Combinatorial Optimization , 1997, J. Heuristics.

[37]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[38]  D. Bertsekas Gradient convergence in gradient methods , 1997 .

[39]  Milos Hauskrecht,et al.  Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[40]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[41]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[42]  Dimitri P. Bertsekas,et al.  Rollout Algorithms for Stochastic Scheduling Problems , 1999, J. Heuristics.

[43]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[44]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[45]  Gerald Tesauro,et al.  Comparison training of chess evaluation functions , 2001 .

[46]  Roberto Frias,et al.  A brief survey , 2011 .

[47]  Gerald Tesauro,et al.  Programming backgammon using self-teaching neural nets , 2002, Artif. Intell..

[48]  Robert Givan,et al.  Approximate Policy Iteration with a Policy Language Bias , 2003, NIPS.

[49]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[50]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[51]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[52]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[53]  Dimitri P. Bertsekas,et al.  Discretized Approximations for POMDP with Average Cost , 2004, UAI.

[54]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[55]  Michael C. Fu,et al.  An Adaptive Sampling Algorithm for Solving Markov Decision Processes , 2005, Oper. Res..

[56]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[57]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[58]  Shie Mannor,et al.  Automatic basis function construction for approximate dynamic programming and reinforcement learning , 2006, ICML.

[59]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[60]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[61]  Benjamin Van Roy Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..

[62]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[63]  Bruno Scherrer,et al.  Performance Bounds for Lambda Policy Iteration and Application to the Game of Tetris , 2007 .

[64]  Frank L. Lewis,et al.  Guest Editorial: Special Issue on Adaptive Dynamic Programming and Reinforcement Learning in Feedback Control , 2008, IEEE Trans. Syst. Man Cybern. Part B.

[65]  Andrew G. Barto,et al.  Efficient skill learning using abstraction selection , 2009, IJCAI 2009.

[66]  Dimitri P. Bertsekas,et al.  Basis function adaptation methods for cost approximation in MDP , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[67]  Dimitri P. Bertsekas,et al.  Error Bounds for Approximations from Projected Linear Equations , 2010, Math. Oper. Res..

[68]  Shie Mannor,et al.  Adaptive Bases for Reinforcement Learning , 2010, ECML/PKDD.

[69]  Bart De Schutter,et al.  Approximate Dynamic Programming and Reinforcement Learning , 2010, Interactive Collaborative Information Systems.

[70]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[71]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[72]  Dimitri P. Bertsekas,et al.  Approximate Dynamic Programming , 2017, Encyclopedia of Machine Learning and Data Mining.

[73]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[74]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.

[75]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[76]  Dimitri P. Bertsekas,et al.  Temporal Difference Methods for General Projected Equations , 2011, IEEE Transactions on Automatic Control.

[77]  Michèle Sebag,et al.  Pilot, Rollout and Monte Carlo Tree Search Methods for Job Shop Scheduling , 2012, LION.

[78]  Frank L. Lewis,et al.  Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles , 2012 .

[79]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[80]  Dimitri P. Bertsekas,et al.  Rollout Algorithms for Discrete Optimization: A Survey , 2012 .

[81]  Frank L. Lewis,et al.  Reinforcement Learning and Approximate Dynamic Programming for Feedback Control , 2012 .

[82]  D. Bertsekas,et al.  Weighted Bellman Equations and their Applications in Approximate Dynamic Programming ∗ , 2012 .

[83]  Dimitri P. Bertsekas,et al.  Abstract Dynamic Programming , 2013 .

[84]  Bruno Scherrer,et al.  Performance bounds for λ policy iteration and application to the game of Tetris , 2013, J. Mach. Learn. Res..

[85]  Steven I. Marcus,et al.  Simulation-based Algorithms for Markov Decision Processes/ Hyeong Soo Chang ... [et al.] , 2013 .

[86]  Bruno Scherrer,et al.  Approximate Dynamic Programming Finally Performs Well in the Game of Tetris , 2013, NIPS.

[87]  Dimitri P. Bertsekas,et al.  Lambda-Policy Iteration: A Review and a New Implementation , 2013, ArXiv.

[88]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[89]  David Silver,et al.  Value Iteration with Options and State Aggregation , 2015, ArXiv.

[90]  Matthieu Geist,et al.  Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[91]  Shie Mannor,et al.  Approximate Value Iteration with Temporally Extended Actions , 2015, J. Artif. Intell. Res..

[92]  Dimitri P. Bertsekas,et al.  Convex Optimization Algorithms , 2015 .

[93]  Nathan S. Netanyahu,et al.  DeepChess: End-to-End Deep Neural Network for Automatic Learning in Chess , 2016, ICANN.

[94]  Dimitri P. Bertsekas Proximal Algorithms and Temporal Differences for Large Linear Systems: Extrapolation, Approximation, and Simulation , 2016, ArXiv.

[95]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[96]  Anil A. Bharath,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[97]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.

[98]  Yurong Liu,et al.  A survey of deep neural network architectures and their applications , 2017, Neurocomputing.

[99]  Yuxi Li,et al.  Deep Reinforcement Learning: An Overview , 2017, ArXiv.

[100]  Dimitri P. Bertsekas,et al.  Proximal algorithms and temporal difference methods for solving fixed point problems , 2018, Comput. Optim. Appl..

[101]  Yuxi Li,et al.  Deep Reinforcement Learning , 2018, Reinforcement Learning for Cyber-Physical Systems.

[102]  Joelle Pineau,et al.  The Bottleneck Simulator: A Model-based Deep Reinforcement Learning Approach , 2018, J. Artif. Intell. Res..