Action Selection for Composable Modular Deep Reinforcement Learning

In modular reinforcement learning (MRL), a complex decision making problem is decomposed into multiple simpler subproblems each solved by a separatemodule. Often, these subproblems have conflicting goals, and incomparable reward scales. A composable decision making architecture requires that even the modules authored separately with possibly misaligned reward scales can be combined coherently. An arbitrator should consider different module’s action preferences to learn effective global action selection. We present a novel framework called GRACIAS that assigns fine-grained importance to the different modules based on their relevance in a given state, and enables composable decision making based on modern deep RL methods such as deep deterministic policy gradient (DDPG) and deep Q-learning. We provide insights into the convergence properties of GRACIAS and also show that previous MRL algorithms reduce to special cases of our framework. We experimentally demonstrate on several standard MRL domains that our approach works significantly better than the previous MRL methods, and is highly robust to incomparable reward scales. Our framework extends MRL to complex Atari games such as Qbert, and has a better learning curve than the conventional RL algorithms.

[1]  R. Mazo On the theory of brownian motion , 1973 .

[2]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[3]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[4]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[5]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[6]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[7]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[8]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[9]  Stuart J. Russell,et al.  Q-Decomposition for Reinforcement Learning Agents , 2003, ICML.

[10]  Christopher L. Simpkins,et al.  Composable Modular Reinforcement Learning , 2019, AAAI.

[11]  Alessandro Saffiotti,et al.  A Multivalued Logic Approach to Integrating Planning and Control , 1995, Artif. Intell..

[12]  Mark Humphreys,et al.  Action selection methods using reinforcement learning , 1997 .

[13]  Gregor Schöner,et al.  A dynamical systems approach to task-level system integration used to plan and control autonomous vehicle motion , 1992, Robotics Auton. Syst..

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Feudal Q-LearningPeter Dayan Feudal Q-learning , 1995 .

[16]  Sergey Levine,et al.  Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning? , 2019, ArXiv.

[17]  Taeseok Jin,et al.  Command Fusion Based Fuzzy Controller Design for Moving Obstacle Avoidance of Mobile Robot , 2013 .

[18]  Rodney A. Brooks,et al.  Achieving Artificial Intelligence through Building Robots , 1986 .

[19]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[20]  Kagan Tumer,et al.  Unifying temporal and structural credit assignment problems , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[21]  Mark Humphrys W-learning: Competition among selfish Q-learners , 1995 .

[22]  Balaraman Ravindran,et al.  Advice Replay Approach for Richer Knowledge Transfer in Teacher Student Framework , 2019, AAMAS.

[23]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[24]  Jonas Karlsson,et al.  Learning to Solve Multiple Goals , 1997 .

[25]  Yoram Koren,et al.  Potential field methods and their inherent limitations for mobile robot navigation , 1991, Proceedings. 1991 IEEE International Conference on Robotics and Automation.

[26]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[27]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[28]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[29]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[30]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[31]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[32]  Ashutosh Saxena,et al.  High speed obstacle avoidance using monocular vision and reinforcement learning , 2005, ICML.

[33]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[34]  Craig Boutilier,et al.  Data center cooling using model-predictive control , 2018, NeurIPS.

[35]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[36]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[37]  Michael Mateas,et al.  Towards adaptive programming: integrating reinforcement learning into a programming language , 2008, OOPSLA.

[38]  Dana H. Ballard,et al.  Multiple-Goal Reinforcement Learning with Modular Sarsa(0) , 2003, IJCAI.

[39]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[40]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[41]  Maja J. Mataric,et al.  Reinforcement Learning in the Multi-Robot Domain , 1997, Auton. Robots.

[42]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[43]  Rodney A. Brooks,et al.  MIT mobile robots-what's next? , 1988, Proceedings. 1988 IEEE International Conference on Robotics and Automation.

[44]  Balaraman Ravindran,et al.  An Enhanced Advising Model in Teacher-Student Framework using State Categorization , 2021, AAAI.

[45]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[46]  Vinny Cahill,et al.  Distributed W-Learning: Multi-Policy Optimization in Self-Organizing Systems , 2009, 2009 Third IEEE International Conference on Self-Adaptive and Self-Organizing Systems.