Episodic Exploration for Deep Deterministic Policies: An Application to StarCraft Micromanagement Tasks

We consider scenarios from the real-time strategy game StarCraft as new benchmarks for reinforcement learning algorithms. We propose micromanagement tasks, which present the problem of the short-term, low-level control of army members during a battle. From a reinforcement learning point of view, these scenarios are challenging because the state-action space is very large, and because there is no obvious feature representation for the state-action evaluation function. We describe our approach to tackle the micromanagement scenarios with deep neural network controllers from raw state features given by the game engine. In addition, we present a heuristic reinforcement learning algorithm which combines direct exploration in the policy space and backpropagation. This algorithm allows for the collection of traces for learning using deterministic policies, which appears much more efficient than, for example, {\epsilon}-greedy exploration. Experiments show that with this algorithm, we successfully learn non-trivial strategies for scenarios with armies of up to 15 agents, where both Q-learning and REINFORCE struggle.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[3]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[4]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[6]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[7]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[8]  James C. Spall,et al.  A one-measurement form of simultaneous perturbation stochastic approximation , 1997, Autom..

[9]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[10]  Manuela M. Veloso,et al.  Team-partitioned, opaque-transition reinforcement learning , 1999, AGENTS '99.

[11]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[12]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[13]  Shie Mannor,et al.  The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[14]  Gerald Tesauro,et al.  Extending Q-Learning to General Adaptive Multi-Agent Systems , 2003, NIPS.

[15]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[16]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[19]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[20]  Sylvain Gelly,et al.  Exploration exploitation in Go: UCT for Monte-Carlo Go , 2006, NIPS 2006.

[21]  Sridhar Mahadevan,et al.  Hierarchical multi-agent reinforcement learning , 2001, AGENTS '01.

[22]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[23]  Frank Sehnke,et al.  Policy Gradients with Parameter-Based Exploration for Control , 2008, ICANN.

[24]  Ludovic Denoyer,et al.  Structured prediction with reinforcement learning , 2009, Machine Learning.

[25]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[26]  Tom Schaul,et al.  Exploring parameter space in reinforcement learning , 2010, Paladyn J. Behav. Robotics.

[27]  Gabriel Synnaeve,et al.  A Bayesian model for RTS units control applied to StarCraft , 2011, 2011 IEEE Conference on Computational Intelligence and Games (CIG'11).

[28]  Ian D. Watson,et al.  Applying reinforcement learning to small scale combat in the real-time strategy game StarCraft:Broodwar , 2012, 2012 IEEE Conference on Computational Intelligence and Games (CIG).

[29]  Michael Buro,et al.  Fast Heuristic Search for RTS Game Combat Scenarios , 2012, AIIDE.

[30]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[31]  Santiago Ontañón,et al.  A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft , 2013, IEEE Transactions on Computational Intelligence and AI in Games.

[32]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[33]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[34]  Siming Liu,et al.  Evolving effective micro behaviors in RTS game , 2014, 2014 IEEE Conference on Computational Intelligence and Games.

[35]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[36]  Martin J. Wainwright,et al.  Optimal Rates for Zero-Order Convex Optimization: The Power of Two Function Evaluations , 2013, IEEE Transactions on Information Theory.

[37]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[38]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[39]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[40]  Fernando Diaz,et al.  Exploratory Gradient Boosting for Reinforcement Learning in Complex Domains , 2016, ArXiv.

[41]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[42]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[43]  Honglak Lee,et al.  Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[44]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[45]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.