Research on the Multiagent Joint Proximal Policy Optimization Algorithm Controlling Cooperative Fixed-Wing UAV Obstacle Avoidance

Multiple unmanned aerial vehicle (UAV) collaboration has great potential. To increase the intelligence and environmental adaptability of multi-UAV control, we study the application of deep reinforcement learning algorithms in the field of multi-UAV cooperative control. Aiming at the problem of a non-stationary environment caused by the change of learning agent strategy in reinforcement learning in a multi-agent environment, the paper presents an improved multiagent reinforcement learning algorithm—the multiagent joint proximal policy optimization (MAJPPO) algorithm with the centralized learning and decentralized execution. This algorithm uses the moving window averaging method to make each agent obtain a centralized state value function, so that the agents can achieve better collaboration. The improved algorithm enhances the collaboration and increases the sum of reward values obtained by the multiagent system. To evaluate the performance of the algorithm, we use the MAJPPO algorithm to complete the task of multi-UAV formation and the crossing of multiple-obstacle environments. To simplify the control complexity of the UAV, we use the six-degree of freedom and 12-state equations of the dynamics model of the UAV with an attitude control loop. The experimental results show that the MAJPPO algorithm has better performance and better environmental adaptability.

[1]  Azer Bestavros,et al.  Reinforcement Learning for UAV Attitude Control , 2018, ACM Trans. Cyber Phys. Syst..

[2]  Xiao Zhang,et al.  Autonomous navigation of UAV in large-scale unknown complex environment with deep reinforcement learning , 2017, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[3]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[4]  Dorian Kodelja,et al.  Multiagent cooperation and competition with deep reinforcement learning , 2015, PloS one.

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[7]  Nicolas Marchand,et al.  Leader-Following Consensus and Formation Control of VTOL-UAVs with Event-Triggered Communications † , 2019, Sensors.

[8]  Frans A. Oliehoek,et al.  A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[9]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[10]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[11]  Shimon Whiteson,et al.  Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2020, J. Mach. Learn. Res..

[12]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[13]  Zhang-Wei Hong,et al.  A Deep Policy Inference Q-Network for Multi-Agent Systems , 2017, AAMAS.

[14]  Pawel Wawrzynski,et al.  Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[15]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[16]  Sidney N. Givigi,et al.  A Q-Learning Approach to Flocking With UAVs in a Stochastic Environment , 2017, IEEE Transactions on Cybernetics.

[17]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[18]  Weihua Sheng,et al.  Multirobot Cooperative Learning for Predator Avoidance , 2015, IEEE Transactions on Control Systems Technology.

[19]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[20]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[21]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[22]  Ming Zhou,et al.  Mean Field Multi-Agent Reinforcement Learning , 2018, ICML.

[23]  Reza Olfati-Saber,et al.  Flocking for multi-agent dynamic systems: algorithms and theory , 2006, IEEE Transactions on Automatic Control.

[24]  Timothy W. McLain,et al.  Small Unmanned Aircraft: Theory and Practice , 2012 .

[25]  Tingting Sun,et al.  Flocking Control of Fixed-Wing UAVs With Cooperative Obstacle Avoidance Capability , 2019, IEEE Access.

[26]  Sanyam Kapoor,et al.  Multi-Agent Reinforcement Learning: A Report on Challenges and Approaches , 2018, ArXiv.

[27]  Yi Wu,et al.  Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient , 2019, AAAI.

[28]  Fang Bin,et al.  Research on Cooperative Collision Avoidance Problem of Multiple UAV Based on Reinforcement Learning , 2017, 2017 10th International Conference on Intelligent Computation Technology and Automation (ICICTA).

[29]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[30]  Jakub W. Pachocki,et al.  Emergent Complexity via Multi-Agent Competition , 2017, ICLR.

[31]  Hung M. La,et al.  Autonomous UAV Navigation Using Reinforcement Learning , 2018, ArXiv.

[32]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[33]  Saeid Nahavandi,et al.  Deep Reinforcement Learning for Multi-Agent Systems: A Review of Challenges, Solutions and Applications , 2018, ArXiv.