Deep Reinforcement Learning-based UAV Navigation and Control: A Soft Actor-Critic with Hindsight Experience Replay Approach

In this paper, we propose SACHER (soft actor-critic (SAC) with hindsight experience replay (HER)), which constitutes a class of deep reinforcement learning (DRL) algorithms. SAC is known as an off-policy model-free DRL algorithm based on the maximum entropy framework, which outperforms earlier DRL algorithms in terms of exploration, robustness and learning performance. However, in SAC, maximizing the entropy-augmented objective may degrade the optimality of the learning outcomes. HER is known as a sample-efficient replay method that enhances the performance of off-policy DRL algorithms by allowing them to learn from both failures and successes. We apply HER to SAC and propose SACHER to improve the learning performance of SAC. More precisely, SACHER achieves the desired optimal outcomes faster and more accurately than SAC, since HER improves the sample efficiency of SAC. We apply SACHER to the navigation and control problem of unmanned aerial vehicles (UAVs), where SACHER generates the optimal navigation path of the UAV under various obstacles in operation. Specifically, we show the effectiveness of SACHER in terms of the tracking error and cumulative reward in UAV operation by comparing them with those of state-of-the-art DRL algorithms, SAC and DDPG. Note that SACHER in UAV navigation and control problems can be applied to arbitrary models of UAVs.

[1]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[2]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[3]  Yuan Shen,et al.  Autonomous Navigation of UAVs in Large-Scale Complex Environments: A Deep Reinforcement Learning Approach , 2019, IEEE Transactions on Vehicular Technology.

[4]  Youmin Zhang,et al.  Adaptive Discrete-Time Flight Control Using Disturbance Observer and Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Rui Wang,et al.  Trajectory tracking control of a 6-DOF quadrotor UAV with input saturation via backstepping , 2018, J. Frankl. Inst..

[6]  Olivier Sigaud,et al.  The problem with DDPG: understanding failures in deterministic environments with sparse rewards , 2019, ICANN.

[7]  Nahum Shimkin,et al.  Nonlinear Control Systems , 2008 .

[8]  Wei He,et al.  Reinforcement Learning Control of a Flexible Two-Link Manipulator: An Experimental Investigation , 2020, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[9]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[10]  Jun Moon,et al.  Leader-follower decentralized optimal control for large population hexarotors with tilted propellers: A Stackelberg game approach , 2019, J. Frankl. Inst..

[11]  Qichao Zhang,et al.  Deep Reinforcement Learning-Based Automatic Exploration for Navigation in Unknown Environment , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[12]  Jaime Lloret Mauri,et al.  A Global Optimal Path Planning and Controller Design Algorithm for Intelligent Vehicles , 2018, Mob. Networks Appl..

[13]  Hasnaa Zidani,et al.  A general Hamilton-Jacobi framework for non-linear state-constrained control problems , 2013 .

[14]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[15]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[16]  Antonio Franchi,et al.  Modeling, control and design optimization for a fully-actuated hexarotor aerial vehicle with tilted propellers , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[17]  C. L. Philip Chen,et al.  Adaptive Position/Attitude Tracking Control of Aerial Robot With Unknown Inertial Matrix Based on a New Robust Neural Identifier , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[21]  Zvi Shiller,et al.  Optimal obstacle avoidance based on the Hamilton-Jacobi-Bellman equation , 1994, IEEE Trans. Robotics Autom..

[22]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[23]  Mircea-Bogdan Radac,et al.  Robust Control of Unknown Observable Nonlinear Systems Solved as a Zero-Sum Game , 2020, IEEE Access.

[24]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[25]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[26]  Tarik Taleb,et al.  UAV-Based IoT Platform: A Crowd Surveillance Use Case , 2017, IEEE Communications Magazine.

[27]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[28]  Giacomo Indiveri,et al.  Parameter Optimization and Learning in a Spiking Neural Network for UAV Obstacle Avoidance Targeting Neuromorphic Processors , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[29]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[30]  Wen Yu,et al.  Discrete-Time H2 Neural Control Using Reinforcement Learning , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[31]  Dimos V. Dimarogonas,et al.  Decentralized motion planning with collision avoidance for a team of UAVs under high level goals , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Randal W. Beard,et al.  Cooperative Path Planning for Target Tracking in Urban Environments Using Unmanned Air and Ground Vehicles , 2015, IEEE/ASME Transactions on Mechatronics.

[33]  Mo Chen,et al.  Reach-avoid problems with time-varying dynamics, targets and constraints , 2014, HSCC.

[34]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.