VMAV-C: A Deep Attention-based Reinforcement Learning Algorithm for Model-based Control

Recent breakthroughs in Go play and strategic games have witnessed the great potential of reinforcement learning in intelligently scheduling in uncertain environment, but some bottlenecks are also encountered when we generalize this paradigm to universal complex tasks. Among them, the low efficiency of data utilization in model-free reinforcement algorithms is of great concern. In contrast, the model-based reinforcement learning algorithms can reveal underlying dynamics in learning environments and seldom suffer the data utilization problem. To address the problem, a model-based reinforcement learning algorithm with attention mechanism embedded is proposed as an extension of World Models in this paper. We learn the environment model through Mixture Density Network Recurrent Network(MDN-RNN) for agents to interact, with combinations of variational auto-encoder(VAE) and attention incorporated in state value estimates during the process of learning policy. In this way, agent can learn optimal policies through less interactions with actual environment, and final experiments demonstrate the effectiveness of our model in control problem.

[1]  Marc Peter Deisenroth,et al.  Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control , 2017, AISTATS.

[2]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[3]  Carlos Torre-Ferrero,et al.  Time-Varying Formation Controllers for Unmanned Aerial Vehicles Using Deep Reinforcement Learning , 2017, ArXiv.

[4]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[5]  Georg B. Keller,et al.  A Sensorimotor Circuit in Mouse Cortex for Visual Flow Predictions , 2017, Neuron.

[6]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[7]  Jürgen Schmidhuber,et al.  World Models , 2018, ArXiv.

[8]  Tamim Asfour,et al.  Model-Based Reinforcement Learning via Meta-Policy Optimization , 2018, CoRL.

[9]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[10]  D. Mobbs,et al.  The ecology of human fear: survival optimization and the nervous system , 2015, Front. Neurosci..

[11]  David Vandyke,et al.  On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems , 2016, ACL.

[12]  Sergey Levine,et al.  Visual Reinforcement Learning with Imagined Goals , 2018, NeurIPS.

[13]  Michael S. Ryoo,et al.  Learning Real-World Robot Policies by Dreaming , 2018, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[14]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[15]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Julian Togelius,et al.  Playing Atari with Six Neurons , 2018, AAMAS.

[17]  Gaurav S. Sukhatme,et al.  Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning , 2017, ICML.

[18]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[19]  MODEL-ENSEMBLE TRUST-REGION POLICY OPTI- , 2017 .

[20]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[21]  Douglas Eck,et al.  A Neural Representation of Sketch Drawings , 2017, ICLR.

[22]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[23]  Yanghe Feng,et al.  Addressing Complexities of Machine Learning in Big Data: Principles, Trends and Challenges from Systematical Perspectives , 2017 .

[24]  Doris Y. Tsao,et al.  The Code for Facial Identity in the Primate Brain , 2017, Cell.

[25]  P. König,et al.  Primary Visual Cortex Represents the Difference Between Past and Present , 2013, Cerebral cortex.

[26]  Subhajit Sanyal,et al.  eCommerceGAN : A Generative Adversarial Network for E-commerce , 2018, ICLR.

[27]  J. Forrester Counterintuitive behavior of social systems , 1971 .

[28]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[29]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[30]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[31]  Gerald Tesauro,et al.  Introduction to the special issue on deep reinforcement learning: An editorial , 2018, Neural Networks.

[32]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.