Representation Discovery for MDPs Using Bisimulation Metrics

We provide a novel, flexible, iterative refinement algorithm to automatically construct an approximate state-space representation for Markov Decision Processes (MDPs). Our approach leverages bisimulation metrics, which have been used in prior work to generate features to represent the state space of MDPs. We address a drawback of this approach, which is the expensive computation of the bisimulation metrics. We propose an algorithm to generate an iteratively improving sequence of state space partitions. Partial metric computations guide the representation search and provide much lower space and computational complexity, while maintaining strong convergence properties. We provide theoretical results guaranteeing convergence as well as experimental illustrations of the accuracy and savings (in time and memory usage) of the new algorithm, compared to traditional bisimulation metric computation.

[1]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[2]  Radha Jagadeesan,et al.  Metrics for labelled Markov processes , 2004, Theor. Comput. Sci..

[3]  Doina Precup,et al.  Bisimulation Metrics are Optimal Value Functions , 2014, UAI.

[4]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[5]  Lihong Li,et al.  Analyzing feature generation for value-function approximation , 2007, ICML '07.

[6]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[7]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[8]  Robert Givan,et al.  Equivalence notions and model minimization in Markov decision processes , 2003, Artif. Intell..

[9]  Kim G. Larsen,et al.  Bisimulation through Probabilistic Testing , 1991, Inf. Comput..

[10]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[11]  C. Atkeson,et al.  Prioritized Sweeping : Reinforcement Learning withLess Data and Less Real , 1993 .

[12]  Doina Precup,et al.  Basis Function Discovery Using Spectral Clustering and Bisimulation Metrics , 2011, AAAI.

[13]  Radha Jagadeesan,et al.  Metrics for Labeled Markov Systems , 1999, CONCUR.

[14]  Doina Precup,et al.  On-the-Fly Algorithms for Bisimulation Metrics , 2012, 2012 Ninth International Conference on Quantitative Evaluation of Systems.

[15]  Balaraman Ravindran,et al.  Model Minimization in Hierarchical Reinforcement Learning , 2002, SARA.

[16]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[17]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[18]  Doina Precup,et al.  Metrics for Finite Markov Decision Processes , 2004, AAAI.

[19]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.