DeepMDP: Learning Continuous Latent Space Models for Representation Learning

Many reinforcement learning (RL) tasks provide the agent with high-dimensional observations that can be simplified into low-dimensional continuous states. To formalize this process, we introduce the concept of a DeepMDP, a parameterized latent space model that is trained via the minimization of two tractable losses: prediction of rewards and prediction of the distribution over next latent states. We show that the optimization of these objectives guarantees (1) the quality of the latent space as a representation of the state space and (2) the quality of the DeepMDP as a model of the environment. We connect these results to prior work in the bisimulation literature, and explore the use of a variety of metrics. Our theoretical findings are substantiated by the experimental result that a trained DeepMDP recovers the latent structure underlying high-dimensional observations on a synthetic environment. Finally, we show that learning a DeepMDP as an auxiliary task in the Atari 2600 domain leads to large performance improvements over model-free RL.

[1]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[2]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[3]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[4]  Adrian S. Lewis,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[5]  Robert Givan,et al.  Equivalence notions and model minimization in Markov decision processes , 2003, Artif. Intell..

[6]  Maria L. Rizzo,et al.  TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION , 2004 .

[7]  Doina Precup,et al.  Metrics for Finite Markov Decision Processes , 2004, AAAI.

[8]  Karl Hinderer,et al.  Lipschitz Continuity of Value Functions in Markovian Decision Processes , 2005, Math. Methods Oper. Res..

[9]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[10]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[11]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[12]  C. Villani Optimal Transport: Old and New , 2008 .

[13]  Doina Precup,et al.  Using Bisimulation for Policy Transfer in MDPs , 2010, AAAI.

[14]  Doina Precup,et al.  Basis Function Discovery Using Spectral Clustering and Bisimulation Metrics , 2011, AAAI.

[15]  Doina Precup,et al.  Bisimulation Metrics for Continuous Markov Decision Processes , 2011, SIAM J. Comput..

[16]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[17]  Kenji Fukumizu,et al.  Equivalence of distance-based and RKHS-based statistics in hypothesis testing , 2012, ArXiv.

[18]  Nan Jiang,et al.  Abstraction Selection in Model-based Reinforcement Learning , 2015, ICML.

[19]  Luca Bascetta,et al.  Policy gradient in Lipschitz Markov Decision Processes , 2015, Machine Learning.

[20]  Doina Precup,et al.  Representation Discovery for MDPs Using Bisimulation Metrics , 2015, AAAI.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[23]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[24]  Razvan Pascanu,et al.  Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.

[25]  Michael L. Littman,et al.  Near Optimal Behavior via Approximate State Abstraction , 2016, ICML.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[28]  Razvan Pascanu,et al.  Visual Interaction Networks: Learning a Physics Simulator from Video , 2017, NIPS.

[29]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[30]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[31]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[32]  Marc G. Bellemare,et al.  The Cramer Distance as a Solution to Biased Wasserstein Gradients , 2017, ArXiv.

[33]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[34]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[35]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[36]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[37]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[38]  Duy Nguyen-Tuong,et al.  Probabilistic Recurrent State-Space Models , 2018, ICML.

[39]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[40]  Sergey Levine,et al.  SOLAR: Deep Structured Latent Representations for Model-Based Reinforcement Learning , 2018, ArXiv.

[41]  Marc G. Bellemare,et al.  Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[42]  Arthur Gretton,et al.  Demystifying MMD GANs , 2018, ICLR.

[43]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[44]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[45]  Kavosh Asadi,et al.  Lipschitz Continuity in Model-based Reinforcement Learning , 2018, ICML.

[46]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[47]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[48]  Fabio Viola,et al.  Learning and Querying Fast Generative Models for Reinforcement Learning , 2018, ArXiv.

[49]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[50]  Marc G. Bellemare,et al.  A Comparative Analysis of Expected and Distributional Reinforcement Learning , 2019, AAAI.

[51]  Nicolas Le Roux,et al.  The Value Function Polytope in Reinforcement Learning , 2019, ICML.

[52]  Nicolas Le Roux,et al.  A Geometric Perspective on Optimal Representations for Reinforcement Learning , 2019, NeurIPS.

[53]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[54]  Joelle Pineau,et al.  Combined Reinforcement Learning via Abstract Representations , 2018, AAAI.

[55]  Marc G. Bellemare,et al.  Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[56]  Martha White,et al.  Two-Timescale Networks for Nonlinear Value Function Approximation , 2019, ICLR.

[57]  Sergey Levine,et al.  SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning , 2018, ICML.

[58]  Yoshua Bengio,et al.  Hyperbolic Discounting and Learning over Multiple Horizons , 2019, ArXiv.

[59]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[60]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[61]  Bernhard Pfahringer,et al.  Regularisation of neural networks by enforcing Lipschitz continuity , 2018, Machine Learning.