Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning

We consider scenarios where a swarm of unmanned vehicles (UxVs) seek to satisfy a number of diverse, spatially distributed objectives. The UxVs strive to determine an efficient plan to service the objectives while operating in a coordinated fashion. We focus on developing autonomous high-level planning, where low-level controls are leveraged from previous work in distributed motion, target tracking, localization, and communication. We rely on the use of state and action abstractions in a Markov decision processes framework to introduce a hierarchical algorithm, Dynamic Domain Reduction for Multi-Agent Planning , that enables multi-agent planning for large multi-objective environments. Our analysis establishes the correctness of our search procedure within specific subsets of the environments, termed ‘sub-environment’ and characterizes the algorithm performance with respect to the optimal trajectories in single-agent and sequential multi-agent deployment scenarios using tools from submodularity. Simulated results show significant improvement over using a standard Monte Carlo tree search in an environment with large state and action spaces.

[1]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[2]  Illah R. Nourbakhsh,et al.  Planning for Human-Robot Interaction Using Time-State Aggregated POMDPs , 2008, AAAI.

[3]  Leslie Pack Kaelbling,et al.  Approximate Planning in POMDPs with Macro-Actions , 2003, NIPS.

[4]  Jonathan P. How,et al.  Decentralized control of partially observable Markov decision processes , 2015, 52nd IEEE Conference on Decision and Control.

[5]  Arkadi Nemirovski,et al.  Robust Convex Optimization , 1998, Math. Oper. Res..

[6]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[7]  Jorge Cortes,et al.  Distributed Control of Robotic Networks: A Mathematical Approach to Motion Coordination Algorithms , 2009 .

[8]  Zhengzhu Feng,et al.  Dynamic Programming for POMDPs Using a Factored State Representation , 2000, AIPS.

[9]  Ronald E. Parr,et al.  Hierarchical control and learning for markov decision processes , 1998 .

[10]  Jorge Cortés,et al.  Dynamic domain reduction for multi-agent planning , 2017, 2017 International Symposium on Multi-Robot and Multi-Agent Systems (MRS).

[11]  R. Bellman Dynamic programming. , 1957, Science.

[12]  Basel Alomair,et al.  Submodularity in Dynamics and Control of Networked Systems , 2016 .

[13]  Gaurav S. Sukhatme,et al.  Data-driven robotic sampling for marine ecosystem monitoring , 2015, Int. J. Robotics Res..

[14]  M. Campi,et al.  The scenario approach for systems and control design , 2008 .

[15]  Stuart J. Russell,et al.  Markovian State and Action Abstractions for MDPs via Hierarchical MCTS , 2016, IJCAI.

[16]  Jorge Cortes,et al.  Coordinated Control of Multi-Robot Systems: A Survey , 2017 .

[17]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[18]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[19]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[20]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[21]  Elman Mansimov,et al.  Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[22]  Steven M. LaValle,et al.  Rapidly-Exploring Random Trees: Progress and Prospects , 2000 .

[23]  Nicholas Roy,et al.  The Belief Roadmap: Efficient Planning in Linear POMDPs by Factoring the Covariance , 2007, ISRR.

[24]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[25]  Frans A. Oliehoek,et al.  A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[26]  Magnus Egerstedt,et al.  Graph Theoretic Methods in Multiagent Networks , 2010, Princeton Series in Applied Mathematics.

[27]  Laurent El Ghaoui,et al.  Robust Solutions to Uncertain Semidefinite Programs , 1998, SIAM J. Optim..

[28]  Andreas S. Schulz,et al.  Revisiting the Greedy Approach to Submodular Set Function Maximization , 2007 .

[29]  Lino Marques,et al.  Robots for Environmental Monitoring: Significant Advancements and Applications , 2012, IEEE Robotics & Automation Magazine.

[30]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[31]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[32]  R. Bellman,et al.  Dynamic Programming and Markov Processes , 1960 .

[33]  David R. Karger,et al.  Approximation algorithms for orienteering and discounted-reward TSP , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[34]  Andreas Krause,et al.  Guarantees for Greedy Maximization of Non-submodular Functions with Applications , 2017, ICML.

[35]  Abhimanyu Das,et al.  Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection , 2011, ICML.

[36]  Nancy M. Amato,et al.  FIRM: Feedback controller-based information-state roadmap - A framework for motion planning under uncertainty , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[37]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .