Learning Representation and Control in Markov Decision Processes: New Frontiers

Learning Representation and Control in Markov Decision Processes describes methods for automatically compressing Markov decision processes (MDPs) by learning a low-dimensional linear approximation defined by an orthogonal set of basis functions. A unique feature of the text is the use of Laplacian operators, whose matrix representations have non-positive off-diagonal elements and zero row sums. The generalized inverses of Laplacian operators, in particular the Drazin inverse, are shown to be useful in the exact and approximate solution of MDPs. The author goes on to describe a broad framework for solving MDPs, generically referred to as representation policy iteration (RPI), where both the basis function representations for approximation of value functions as well as the optimal policy within their linear span are simultaneously learned. Basis functions are constructed by diagonalizing a Laplacian operator or by dilating the reward function or an initial set of bases by powers of the operator. The idea of decomposing an operator by finding its invariant subspaces is shown to be an important principle in constructing low-dimensional representations of MDPs. Theoretical properties of these approaches are discussed, and they are also compared experimentally on a variety of discrete and continuous MDPs. Finally, challenges for further research are briefly outlined. Learning Representation and Control in Markov Decision Processes is a timely exposition of a topic with broad interest within machine learning and beyond.

[1]  Clarence E. Rose,et al.  What is tensor analysis? , 1938, Electrical Engineering.

[2]  Saul Amarel,et al.  On representations of problems of reasoning about actions , 1968 .

[3]  P. Schweitzer Perturbation theory and finite Markov chains , 1968 .

[4]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[5]  Jean-Pierre Serre,et al.  Linear representations of finite groups , 1977, Graduate texts in mathematics.

[6]  C. D. Meyer,et al.  Generalized inverses of linear transformations , 1979 .

[7]  J. Eells EIGENVALUES IN RIEMANNIAN GEOMETRY (Pure and Applied Mathematics: A Series of Monographs and Textbooks, 115) , 1985 .

[8]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[9]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[10]  Stéphane Mallat,et al.  A Theory for Multiresolution Signal Decomposition: The Wavelet Representation , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  G. Dunteman Principal Components Analysis , 1989 .

[12]  Devika Subramanian,et al.  A Theory of Justified Reformulations , 1989, ML.

[13]  G. Wahba Spline models for observational data , 1990 .

[14]  R. Coifman,et al.  Fast wavelet transforms and numerical algorithms I , 1991 .

[15]  V. N. Bogaevski,et al.  Matrix Perturbation Theory , 1991 .

[16]  S. Axler,et al.  Harmonic Function Theory , 1992 .

[17]  Ingrid Daubechies,et al.  Ten Lectures on Wavelets , 1992 .

[18]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[19]  C. Loan,et al.  Approximation with Kronecker Products , 1992 .

[20]  D. Gurarie Symmetries and Laplacians: Introduction to Harmonic Analysis, Group Representations and Applications , 1992 .

[21]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[22]  J. A. López del Val,et al.  Principal Components Analysis , 2018, Applied Univariate, Bivariate, and Multivariate Statistics Using Python.

[23]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[24]  Robert J. Plemmons,et al.  Nonnegative Matrices in the Mathematical Sciences , 1979, Classics in Applied Mathematics.

[25]  Iain MacLeod,et al.  Generalised Matrix Inversion and Rank Computation by Successive Matrix Powering , 1994, Parallel Comput..

[26]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[27]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[28]  C. D. Meyer Sensitivity of the Stationary Distribution of a Markov Chain , 1994, SIAM J. Matrix Anal. Appl..

[29]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[30]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[31]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[32]  D. Cvetkovic,et al.  Spectra of graphs : theory and application , 1995 .

[33]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[34]  Anders R. Kristensen,et al.  Dynamic programming and Markov decision processes , 1996 .

[35]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[36]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[37]  S. Rosenberg The Laplacian on a Riemannian Manifold: The Laplacian on a Riemannian Manifold , 1997 .

[38]  Robert Givan,et al.  Model Minimization in Markov Decision Processes , 1997, AAAI/IAAI.

[39]  Andrew W. Moore,et al.  Barycentric Interpolators for Continuous Space and Time Reinforcement Learning , 1998, NIPS.

[40]  Benjamin Van Roy Learning and value function approximation in complex decision processes , 1998 .

[41]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[42]  S. Mallat A wavelet tour of signal processing , 1998 .

[43]  Xi-Ren Cao,et al.  The Relations Among Potentials, Perturbation Analysis, and Markov Decision Processes , 1998, Discret. Event Dyn. Syst..

[44]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[45]  Charles W. Anderson,et al.  Using Temporal Neighborhoods to Adapt Function Approximators in Reinforcement Learning , 1999, IWANN.

[46]  S. Mahadevan,et al.  Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning , 1999 .

[47]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[48]  G. Micula,et al.  Numerical Treatment of the Integral Equations , 1999 .

[49]  Jesse Hoey,et al.  SPUDD: Stochastic Planning using Decision Diagrams , 1999, UAI.

[50]  Jianbo Shi,et al.  Learning Segmentation by Random Walks , 2000, NIPS.

[51]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[52]  Yimin Wei,et al.  Successive matrix squaring algorithm for computing the Drazin inverse , 2000, Appl. Math. Comput..

[53]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[54]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[55]  Jesse Hoey,et al.  APRICODD: Approximate Policy Construction Using Decision Diagrams , 2000, NIPS.

[56]  F. Deutsch Best approximation in inner product spaces , 2001 .

[57]  P. Diaconis,et al.  A geometric interpretation of the Metropolis-Hastings algorithm , 2001 .

[58]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[59]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[60]  M. Eiermann,et al.  Geometric aspects of the theory of Krylov subspace methods , 2001, Acta Numerica.

[61]  Xin Wang,et al.  Batch Value Function Approximation via Support Vectors , 2001, NIPS.

[62]  Bernhard Schölkopf,et al.  Sampling Techniques for Kernel Methods , 2001, NIPS.

[63]  Andrew G. Barto,et al.  Autonomous discovery of temporal abstractions from interaction with an environment , 2002 .

[64]  P. Chebotarev,et al.  Forest Matrices Around the Laplaeian Matrix , 2002, math/0508178.

[65]  Paul E. Utgoff,et al.  Many-Layered Learning , 2002, Neural Computation.

[66]  Chris Drummond,et al.  Accelerating Reinforcement Learning by Composing Solutions of Automatically Identified Subtasks , 2011, J. Artif. Intell. Res..

[67]  John M. Lee Introduction to Smooth Manifolds , 2002 .

[68]  Craig Boutilier,et al.  Greedy linear value-approximation for factored Markov decision processes , 2002, AAAI/IAAI.

[69]  C. D. Meyer,et al.  Updating the stationary vector of an irreducible Markov chain , 2002 .

[70]  Jitendra Malik,et al.  Spectral Partitioning with Indefinite Kernels Using the Nyström Extension , 2002, ECCV.

[71]  Craig Boutilier,et al.  Value-Directed Compression of POMDPs , 2002, NIPS.

[72]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[73]  Balaraman Ravindran,et al.  SMDP Homomorphisms: An Algebraic Approach to Abstraction in Semi-Markov Decision Processes , 2003, IJCAI.

[74]  S. Shankar Sastry,et al.  Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[75]  Shobha Venkataraman,et al.  Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[76]  Elias M. Stein,et al.  Fourier Analysis: An Introduction , 2003 .

[77]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[78]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[79]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[80]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[81]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[82]  Daniel N. Rockmore,et al.  Computing Isotypic Projections with the Lanczos Iteration , 2003, SIAM J. Matrix Anal. Appl..

[83]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[84]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[85]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[86]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[87]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[88]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[89]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[90]  Shie Mannor,et al.  Dynamic abstraction in reinforcement learning via clustering , 2004, ICML.

[91]  R. Coifman,et al.  Diffusion Wavelets , 2004 .

[92]  Geoffrey E. Hinton,et al.  Reinforcement Learning with Factored States and Actions , 2004, J. Mach. Learn. Res..

[93]  P. Chebotarev,et al.  On of the Spectra of Nonsymmetric Laplacian Matrices , 2004, math/0508176.

[94]  Peter Dayan,et al.  Structure in the Space of Value Functions , 2002, Machine Learning.

[95]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[96]  Ann B. Lee,et al.  Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[97]  Sridhar Mahadevan,et al.  Coarticulation: an approach for generating concurrent plans in Markov decision processes , 2005, ICML.

[98]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[99]  Ann B. Lee,et al.  Geometric diffusions as a tool for harmonic analysis and structure definition of data: multiscale methods. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[100]  Ronald Rosenfeld,et al.  Semi-supervised learning with graphs , 2005 .

[101]  Sridhar Mahadevan,et al.  Representation Policy Iteration , 2005, UAI.

[102]  Mauro Maggioni,et al.  Geometric diffusions for the analysis of data from sensor networks , 2005, Current Opinion in Neurobiology.

[103]  Christian P. Robert,et al.  Monte Carlo Statistical Methods (Springer Texts in Statistics) , 2005 .

[104]  M. Maggioni,et al.  GEOMETRIC DIFFUSIONS AS A TOOL FOR HARMONIC ANALYSIS AND STRUCTURE DEFINITION OF DATA PART I: DIFFUSION MAPS , 2005 .

[105]  John D. Lafferty,et al.  Diffusion Kernels on Statistical Manifolds , 2005, J. Mach. Learn. Res..

[106]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[107]  Sridhar Mahadevan,et al.  Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions , 2005, NIPS.

[108]  F. Chung Laplacians and the Cheeger Inequality for Directed Graphs , 2005 .

[109]  Sridhar Mahadevan,et al.  Proto-value functions: developmental reinforcement learning , 2005, ICML.

[110]  Mark Herbster,et al.  Online learning over graphs , 2005, ICML.

[111]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[112]  Sridhar Mahadevan,et al.  Learning Representation and Control in Continuous Markov Decision Processes , 2006, AAAI.

[113]  Sridhar Mahadevan,et al.  Fast direct policy evaluation using multiscale analysis of Markov diffusion processes , 2006, ICML.

[114]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[115]  Steven M. LaValle,et al.  Planning algorithms , 2006 .

[116]  Milos Hauskrecht,et al.  Learning Basis Functions in Hybrid Domains , 2006, AAAI.

[117]  S. Mahadevan,et al.  Proto-transfer Learning in Markov Decision Processes Using Spectral Methods , 2006 .

[118]  Amy Nicole Langville,et al.  Updating Markov Chains with an Eye on Google's PageRank , 2005, SIAM J. Matrix Anal. Appl..

[119]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[120]  Nathaniel E. Helwig,et al.  An Introduction to Linear Algebra , 2006 .

[121]  Arthur D. Szlam,et al.  Diffusion wavelet packets , 2006 .

[122]  IV JohnS.Caughman,et al.  Kernels of Directed Graph Laplacians , 2006, Electron. J. Comb..

[123]  G. Swaminathan Robot Motion Planning , 2006 .

[124]  Sridhar Mahadevan,et al.  Constructing basis functions from directed graphs for value function approximation , 2007, ICML '07.

[125]  Sridhar Mahadevan,et al.  Learning state-action basis functions for hierarchical MDPs , 2007, ICML '07.

[126]  Ulrike von Luxburg,et al.  Graph Laplacians and their Convergence on Random Neighborhood Graphs , 2006, J. Mach. Learn. Res..

[127]  Sridhar Mahadevan,et al.  Hierarchical Average Reward Reinforcement Learning , 2007, J. Mach. Learn. Res..

[128]  Chang Wang,et al.  Compact Spectral Bases for Value Function Approximation Using Kronecker Factorization , 2007, AAAI.

[129]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[130]  Abhijit Gosavi,et al.  Self-Improving Factory Simulation using Continuous-time Average-Reward Reinforcement Learning , 2007 .

[131]  Warren B. Powell,et al.  Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics) , 2007 .

[132]  Peter F. Stadler,et al.  Laplacian Eigenvectors of Graphs , 2007 .

[133]  Marek Petrik,et al.  An Analysis of Laplacian Methods for Value Function Approximation in MDPs , 2007, IJCAI.

[134]  Lihong Li,et al.  Analyzing feature generation for value-function approximation , 2007, ICML '07.

[135]  A. Kaveh,et al.  Block diagonalization of Laplacian matrices of symmetric graphs via group theory , 2007 .

[136]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[137]  M. Maggioni,et al.  Universal Local Parametrizations via Heat Kernels and Eigenfunctions of the Laplacian , 2007, 0709.1975.

[138]  Jonathan P. How,et al.  Approximate dynamic programming using support vector regression , 2008, 2008 47th IEEE Conference on Decision and Control.

[139]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[140]  Stphane Mallat,et al.  A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way , 2008 .

[141]  Sridhar Mahadevan Fast Spectral Learning using Lanczos Eigenspace Projections , 2008, AAAI.

[142]  Sridhar Mahadevan,et al.  Representation Discovery using Harmonic Analysis , 2008, Representation Discovery using Harmonic Analysis.

[143]  Von-Wun Soo,et al.  Graph Laplacian based transfer learning in reinforcement learning , 2008, AAMAS.

[144]  S. Mahadevan,et al.  Action-based representation discovery in markov decision processes , 2009 .

[145]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[146]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[147]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[148]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[149]  Christian Robert Monte Carlo Methods in Statistics , 2011, International Encyclopedia of Statistical Science.