Growing Action Spaces

In complex tasks, such as those with large combinatorial action spaces, random exploration may be too inefficient to achieve meaningful learning progress. In this work, we use a curriculum of progressively growing action spaces to accelerate learning. We assume the environment is out of our control, but that the agent may set an internal curriculum by initially restricting its action space. Our approach uses off-policy reinforcement learning to estimate optimal value functions for multiple action spaces simultaneously and efficiently transfers data, value estimates, and state representations from restricted action spaces to the full task. We show the efficacy of our approach in proof-of-concept control tasks and on challenging large-scale StarCraft micromanagement tasks with large, multi-agent action spaces.

[1]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[2]  Martha White,et al.  Reinforcement Learning with Function-Valued Action Spaces for Partial Differential Equation Control , 2018, ICML.

[3]  Andriy Mnih,et al.  Q-Learning in enormous action spaces via amortized approximate maximization , 2020, ArXiv.

[4]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[5]  Marco Colombetti,et al.  Robot shaping: developing situated agents through learning , 1992 .

[6]  Trevor Darrell,et al.  Modular Architecture for StarCraft II with Deep Reinforcement Learning , 2018, AIIDE.

[7]  Christoph H. Lampert,et al.  Curriculum learning of multiple tasks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Shie Mannor,et al.  The Natural Language of Actions , 2019, ICML.

[9]  Andrew G. Barto,et al.  Autonomous shaping: knowledge transfer in reinforcement learning , 2006, ICML.

[10]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[11]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[12]  Andrew W. Moore,et al.  Variable Resolution Discretization in Optimal Control , 2002, Machine Learning.

[13]  Richard S. Sutton,et al.  Training and Tracking in Robotics , 1985, IJCAI.

[14]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[15]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[16]  Tom Schaul,et al.  StarCraft II: A New Challenge for Reinforcement Learning , 2017, ArXiv.

[17]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[18]  Andrew W. Moore,et al.  The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces , 1993, Machine Learning.

[19]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[20]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[21]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[22]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[23]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[24]  Risto Miikkulainen,et al.  Competitive Coevolution through Evolutionary Complexification , 2011, J. Artif. Intell. Res..

[25]  Yee Whye Teh,et al.  Mix&Match - Agent Curricula for Reinforcement Learning , 2018, ICML.

[26]  Satinder Singh Transfer of learning by composing solutions of elemental sequential tasks , 2004, Machine Learning.

[27]  Arash Tavakoli,et al.  Action Branching Architectures for Deep Reinforcement Learning , 2017, AAAI.

[28]  Minoru Asada,et al.  Purposive Behavior Acquisition for a Real Robot by Vision-Based Reinforcement Learning , 2005, Machine Learning.

[29]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[30]  Florian Richoux,et al.  TorchCraft: a Library for Machine Learning Research on Real-Time Strategy Games , 2016, ArXiv.

[31]  Richard Evans,et al.  Deep Reinforcement Learning in Large Discrete Action Spaces , 2015, 1512.07679.

[32]  Shimon Whiteson,et al.  The StarCraft Multi-Agent Challenge , 2019, AAMAS.

[33]  Abhinav Gupta,et al.  CASSL: Curriculum Accelerated Self-Supervised Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[34]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[35]  Minoru Asada,et al.  Purposive behavior acquisition for a real robot by vision-based reinforcement learning , 1995, Machine Learning.

[36]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[37]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[38]  Nataliya Sokolovska,et al.  Continuous Upper Confidence Trees , 2011, LION.

[39]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[40]  Peter Stone,et al.  Transfer Learning via Inter-Task Mappings for Temporal Difference Learning , 2007, J. Mach. Learn. Res..

[41]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[42]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[43]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[44]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[45]  Ilya Kostrikov,et al.  Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play , 2017, ICLR.

[46]  Shimon Whiteson,et al.  Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2020, J. Mach. Learn. Res..

[47]  H. Jaap van den Herik,et al.  Progressive Strategies for Monte-Carlo Tree Search , 2008 .

[48]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[49]  S. Whiteson,et al.  Adaptive Tile Coding for Value Function Approximation , 2007 .

[50]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[51]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[52]  Nicolas Usunier,et al.  Episodic Exploration for Deep Deterministic Policies: An Application to StarCraft Micromanagement Tasks , 2016, ArXiv.

[53]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.