Cooperation and communication in multiagent deep reinforcement learning

Acknowledgments Many thanks to my advisor Peter Stone for many years of support and encouragement throughout the PhD process. In particular, I remember a dark time when I was directionless and depressed about the prospect of finding fruitful research directions or ever graduating. You told me that all PhDs go through a period where they are lost in the woods and encouraged me to continue onward. This thesis is testament to the fact that given sufficient persistence, encouragement, and caffeine, there is a way out of the woods. Thanks to my lab-mates and colleagues who have shaped my research and added flavor to my graduate experience: for the time and feedback they so generously gave. Thanks to Kay Nettle and Amy Bush for tirelessly supporting and debug-ging the many issues that arise from setting up and maintaining a GPU cluster. The research in this thesis would have been impossible (or at least computationally intractable) without your help. Special thanks to my loving family and my fiancé Man Liang for their unconditional support throughout the ups and downs of the PhD process. Reinforcement learning is the area of machine learning concerned with learning which actions to execute in an unknown environment in order to maximize cumulative reward. As agents begin to perform tasks of genuine interest to humans, they will be faced with environments too complex for humans to predetermine the correct actions using hand-designed solutions. Instead, capable learning agents will be necessary to tackle complex real-world domains. However, traditional reinforcement learning algorithms have difficulty with domains featuring 1) high-dimensional continuous state spaces, for example pixels from a camera image, 2) high-dimensional parameterized-continuous action spaces, 3) partial observabil-ity, and 4) multiple independent learning agents. We hypothesize that deep neural networks hold the key to scaling reinforcement learning towards complex tasks. This thesis seeks to answer the following two-part question: 1) How can the power of Deep Neural Networks be leveraged to extend Reinforcement Learning to complex environments featuring partial observability, high-dimensional parameterized-continuous state and action spaces, and sparse re-wards? 2) How can multiple Deep Reinforcement Learning agents learn to cooperate in a multiagent setting? To address the first part of this question, this thesis explores the idea of using recurrent neural networks to combat partial observability experienced by agents in the domain of Atari 2600 video games. Next, we design a deep reinforcement learning agent capable of discovering effective policies for …

[1]  Richard L. Lewis,et al.  Optimal rewards in multiagent teams , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[4]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[5]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[6]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[7]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[8]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[9]  Peter Stone and Patrick Riley and Manuela Veloso Defining and Using Ideal Teammate and Opponent Models , 2000 .

[10]  Feng Wu,et al.  Online planning for large MDPs with MAXQ decomposition , 2012, AAMAS.

[11]  Richard L. Lewis,et al.  Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective , 2010, IEEE Transactions on Autonomous Mental Development.

[12]  Victor R. Lesser,et al.  Coordinating multi-agent reinforcement learning with limited communication , 2013, AAMAS.

[13]  Sarit Kraus,et al.  Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination , 2010, AAAI.

[14]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[15]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[16]  Peter Stone,et al.  Learning Powerful Kicks on the Aibo ERS-7: The Quest for a Striker , 2010, RoboCup.

[17]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[18]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[19]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[21]  Tucker R. Balch,et al.  Distributed sensor fusion for object position estimation by multi-robot systems , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[22]  Astro Teller,et al.  Evolving Team Darwin United , 1998, RoboCup.

[23]  Sarit Kraus,et al.  Learning Teammate Models for Ad Hoc Teamwork , 2012, AAMAS 2012.

[24]  Sean Luke,et al.  Cooperative Multi-Agent Learning: The State of the Art , 2005, Autonomous Agents and Multi-Agent Systems.

[25]  Fei-Fei Li,et al.  Visualizing and Understanding Recurrent Networks , 2015, ArXiv.

[26]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[27]  Shimon Whiteson,et al.  Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks , 2016, ArXiv.

[28]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[29]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[30]  Peter Stone,et al.  Intrinsically motivated model learning for developing curious robots , 2017, Artif. Intell..

[31]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[33]  Samuel Barrett,et al.  Making Friends on the Fly: Advances in Ad Hoc Teamwork , 2015, Studies in Computational Intelligence.

[34]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[35]  Sergey Levine,et al.  Learning Visual Feature Spaces for Robotic Manipulation with Deep Spatial Autoencoders , 2015, ArXiv.

[36]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[37]  Patrick MacAlpine,et al.  UT Austin Villa 2014: RoboCup 3D Simulation League Champion via Overlapping Layered Learning , 2015, AAAI.

[38]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[39]  Risto Miikkulainen,et al.  A Neuroevolution Approach to General Atari Game Playing , 2014, IEEE Transactions on Computational Intelligence and AI in Games.

[40]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[41]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[42]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[43]  Richard L. Lewis,et al.  Optimal Rewards for Cooperative Agents , 2014, IEEE Transactions on Autonomous Mental Development.

[44]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[45]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[46]  Risto Miikkulainen,et al.  Efficient Reinforcement Learning Through Evolving Neural Network Topologies , 2002, GECCO.

[47]  Manuela M. Veloso,et al.  Layered Learning , 2000, ECML.

[48]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[49]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[50]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[51]  Keith B. Hall,et al.  Correlated Q-Learning , 2003, ICML.

[52]  Feng Wu,et al.  Towards a Principled Solution to Simulated Robot Soccer , 2012, RoboCup.

[53]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[54]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[55]  Peter Stone,et al.  Cooperating with Unknown Teammates in Complex Domains: A Robot Soccer Case Study of Ad Hoc Teamwork , 2015, AAAI.

[56]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[57]  Honglak Lee,et al.  Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[58]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[59]  Richard L. Lewis,et al.  Reward Design via Online Gradient Ascent , 2010, NIPS.

[60]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[61]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[62]  Matthew Hausknecht and Peter Stone On-Policy vs. Off-Policy Updates for Deep Reinforcement Learning , 2016 .

[63]  Bruno Castro da Silva,et al.  Learning parameterized motor skills on a humanoid robot , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[64]  Martin A. Riedmiller,et al.  Reinforcement learning in feedback control , 2011, Machine Learning.

[65]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[66]  Regina Barzilay,et al.  Language Understanding for Text-based Games using Deep Reinforcement Learning , 2015, EMNLP.

[67]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[68]  Shimon Whiteson,et al.  Concurrent layered learning , 2003, AAMAS '03.

[69]  Peter Stone,et al.  Source Task Creation for Curriculum Learning , 2016, AAMAS.

[70]  Martín Abadi,et al.  Learning to Protect Communications with Adversarial Neural Cryptography , 2016, ArXiv.

[71]  Peter Stone,et al.  Machine Learning for Fast Quadrupedal Locomotion , 2004, AAAI.

[72]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[73]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[74]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[75]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[76]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[77]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[78]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[79]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[80]  Jelle R. Kok,et al.  Mutual Modeling of Teammate Behavior , 2002 .

[81]  Francisco S. Melo,et al.  Q -Learning with Linear Function Approximation , 2007, COLT.

[82]  Peter Stone,et al.  Half Field Offense in RoboCup Soccer: A Multiagent Reinforcement Learning Case Study , 2006, RoboCup.

[83]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[84]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[85]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[86]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[87]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[88]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[89]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[90]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[91]  Milind Tambe,et al.  Towards Flexible Teamwork , 1997, J. Artif. Intell. Res..

[92]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[93]  Peter Stone,et al.  Deep Reinforcement Learning in Parameterized Action Space , 2015, ICLR.

[94]  Richard L. Lewis,et al.  Reward Mapping for Transfer in Long-Lived Agents , 2013, NIPS.

[95]  Martin A. Riedmiller,et al.  On Experiences in a Complex and Competitive Gaming Domain: Reinforcement Learning Meets RoboCup , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[96]  Shimon Whiteson,et al.  Adaptive Representations for Reinforcement Learning , 2010, Studies in Computational Intelligence.

[97]  Pucheng Zhou,et al.  Multi-agent cooperation by reinforcement learning with teammate modeling and reward allotment , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[98]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[99]  Pravesh Ranchod,et al.  Reinforcement Learning with Parameterized Actions , 2015, AAAI.

[100]  Martin A. Riedmiller,et al.  Autonomous reinforcement learning on raw visual input data in a real world application , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[101]  DarrellTrevor,et al.  End-to-end training of deep visuomotor policies , 2016 .

[102]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[103]  Markus Wulfmeier,et al.  Deep Inverse Reinforcement Learning , 2015, ArXiv.