论文信息 - DeepMDP: Learning Continuous Latent Space Models for Representation Learning - 字舞流文

DeepMDP: Learning Continuous Latent Space Models for Representation Learning

Many reinforcement learning (RL) tasks provide the agent with high-dimensional observations that can be simplified into low-dimensional continuous states. To formalize this process, we introduce the concept of a DeepMDP, a parameterized latent space model that is trained via the minimization of two tractable losses: prediction of rewards and prediction of the distribution over next latent states. We show that the optimization of these objectives guarantees (1) the quality of the latent space as a representation of the state space and (2) the quality of the DeepMDP as a model of the environment. We connect these results to prior work in the bisimulation literature, and explore the use of a variety of metrics. Our theoretical findings are substantiated by the experimental result that a trained DeepMDP recovers the latent structure underlying high-dimensional observations on a synthetic environment. Finally, we show that learning a DeepMDP as an auxiliary task in the Atari 2600 domain leads to large performance improvements over model-free RL.

Marc G. Bellemare | Ofir Nachum | Carles Gelada | Saurabh Kumar | Jacob Buckman | Saurabh Kumar | Carles Gelada | J. Buckman | Ofir Nachum | Jacob Buckman

[1] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[2] Michael I. Jordan,et al. Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[3] A. Müller. Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[4] Adrian S. Lewis,et al. Convex Analysis And Nonlinear Optimization , 2000 .

[5] Robert Givan,et al. Equivalence notions and model minimization in Markov decision processes , 2003, Artif. Intell..

[6] Maria L. Rizzo,et al. TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION , 2004 .

[7] Doina Precup,et al. Metrics for Finite Markov Decision Processes , 2004, AAAI.

[8] Karl Hinderer,et al. Lipschitz Continuity of Value Functions in Markovian Decision Processes , 2005, Math. Methods Oper. Res..

[9] Thomas J. Walsh,et al. Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[10] Sridhar Mahadevan,et al. Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[11] Lihong Li,et al. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[12] C. Villani. Optimal Transport: Old and New , 2008 .

[13] Doina Precup,et al. Using Bisimulation for Policy Transfer in MDPs , 2010, AAAI.

[14] Doina Precup,et al. Basis Function Discovery Using Spectral Clustering and Bisimulation Metrics , 2011, AAAI.

[15] Doina Precup,et al. Bisimulation Metrics for Continuous Markov Decision Processes , 2011, SIAM J. Comput..

[16] Bernhard Schölkopf,et al. A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[17] Kenji Fukumizu,et al. Equivalence of distance-based and RKHS-based statistics in hypothesis testing , 2012, ArXiv.

[18] Nan Jiang,et al. Abstraction Selection in Model-based Reinforcement Learning , 2015, ICML.

[19] Luca Bascetta,et al. Policy gradient in Lipschitz Markov Decision Processes , 2015, Machine Learning.

[20] Doina Precup,et al. Representation Discovery for MDPs Using Bisimulation Metrics , 2015, AAAI.

[21] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[22] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[23] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[24] Razvan Pascanu,et al. Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.

[25] Michael L. Littman,et al. Near Optimal Behavior via Approximate State Abstraction , 2016, ICML.

[26] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[27] Tom Schaul,et al. The Predictron: End-To-End Learning and Planning , 2016, ICML.

[28] Razvan Pascanu,et al. Visual Interaction Networks: Learning a Physics Simulator from Video , 2017, NIPS.

[29] Léon Bottou,et al. Wasserstein Generative Adversarial Networks , 2017, ICML.

[30] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[31] Satinder Singh,et al. Value Prediction Network , 2017, NIPS.

[32] Marc G. Bellemare,et al. The Cramer Distance as a Solution to Biased Wasserstein Gradients , 2017, ArXiv.

[33] Razvan Pascanu,et al. Learning to Navigate in Complex Environments , 2016, ICLR.

[34] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[35] Tom Schaul,et al. Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[36] Aaron C. Courville,et al. Improved Training of Wasserstein GANs , 2017, NIPS.

[37] Honglak Lee,et al. Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[38] Duy Nguyen-Tuong,et al. Probabilistic Recurrent State-Space Models , 2018, ICML.

[39] Sergey Levine,et al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[40] Sergey Levine,et al. SOLAR: Deep Structured Latent Representations for Model-Based Reinforcement Learning , 2018, ArXiv.

[41] Marc G. Bellemare,et al. Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[42] Arthur Gretton,et al. Demystifying MMD GANs , 2018, ICLR.

[43] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[44] Sergey Levine,et al. Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[45] Kavosh Asadi,et al. Lipschitz Continuity in Model-based Reinforcement Learning , 2018, ICML.

[46] Jürgen Schmidhuber,et al. Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[47] Rémi Munos,et al. Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[48] Fabio Viola,et al. Learning and Querying Fast Generative Models for Reinforcement Learning , 2018, ArXiv.

[49] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[50] Marc G. Bellemare,et al. A Comparative Analysis of Expected and Distributional Reinforcement Learning , 2019, AAAI.

[51] Nicolas Le Roux,et al. The Value Function Polytope in Reinforcement Learning , 2019, ICML.

[52] Nicolas Le Roux,et al. A Geometric Perspective on Optimal Representations for Reinforcement Learning , 2019, NeurIPS.

[53] Yuandong Tian,et al. Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[54] Joelle Pineau,et al. Combined Reinforcement Learning via Abstract Representations , 2018, AAAI.

[55] Marc G. Bellemare,et al. Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[56] Martha White,et al. Two-Timescale Networks for Nonlinear Value Function Approximation , 2019, ICLR.

[57] Sergey Levine,et al. SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning , 2018, ICML.

[58] Yoshua Bengio,et al. Hyperbolic Discounting and Learning over Multiple Horizons , 2019, ArXiv.

[59] Ruben Villegas,et al. Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[60] Sergey Levine,et al. Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[61] Bernhard Pfahringer,et al. Regularisation of neural networks by enforcing Lipschitz continuity , 2018, Machine Learning.