Hybrid Hierarchical Reinforcement Learning for online guidance and navigation with partial observability

Abstract Autonomous guidance and navigation problems often have high-dimensional spaces, multiple objectives, and consequently a large number of states and actions, which is known as the ‘curse of dimensionality’. Furthermore, systems often have partial observability instead of a perfect perception of their environment. Recent research has sought to deal with these problems by using Hierarchical Reinforcement Learning, which often uses same or similar reinforcement learning methods within one application so that multiple objectives can be combined. However, there is not a single learning method that can benefit all targets. To acquire optimal decision-making most efficiently, this paper proposes a hybrid Hierarchical Reinforcement Learning method consisting of several levels, where each level uses various methods to optimize the learning with different types of information and objectives. An algorithm is provided using the proposed method and applied to an online guidance and navigation task. The navigation environments are complex, partially observable, and a priori unknown. Simulation results indicate that the proposed hybrid Hierarchical Reinforcement Learning method, compared to flat or non-hybrid methods, can help to accelerate learning, to alleviate the ‘curse of dimensionality’ in complex decision-making tasks. In addition, the mixture of relative micro states and absolute macro states can help to reduce the uncertainty or ambiguity at high levels, to transfer the learned results within and across tasks efficiently, and to apply to non-stationary environments. This proposed method can yield a hierarchical optimal policy for autonomous guidance and navigation without a priori knowledge of the system or the environment.

[1]  Ling Shao,et al.  Transfer Learning for Visual Categorization: A Survey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[3]  Amparo Alonso-Betanzos,et al.  A robust incremental learning method for non-stationary environments , 2011, Neurocomputing.

[4]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[5]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[6]  Sridhar Mahadevan,et al.  Learning hierarchical observable Markov decision process models for robot navigation , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[7]  Joseph A. Paradiso,et al.  The gesture recognition toolkit , 2014, J. Mach. Learn. Res..

[8]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[9]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[10]  Glen Berseth,et al.  DeepLoco , 2017, ACM Trans. Graph..

[11]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[12]  Erik-Jan Van Kampen,et al.  Incremental model based online dual heuristic programming for nonlinear adaptive control , 2018 .

[13]  Jesse Hoey,et al.  Affect control processes: Intelligent affective interaction using a partially observable Markov decision process , 2013, Artif. Intell..

[14]  Evan Dekker,et al.  Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.

[15]  Panos E. Trahanias,et al.  Real-time hierarchical POMDPs for autonomous robot navigation , 2007, Robotics Auton. Syst..

[16]  Qiang Yang,et al.  Action-model acquisition for planning via transfer learning , 2014, Artif. Intell..

[17]  Dewen Hu,et al.  Multiobjective Reinforcement Learning: A Comprehensive Overview , 2015, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[18]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[19]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[20]  Xuegang Hu,et al.  Domain adaptation via Multi-Layer Transfer Learning , 2016, Neurocomputing.

[21]  Alessandro Lazaric,et al.  Transfer in Reinforcement Learning: A Framework and a Survey , 2012, Reinforcement Learning.

[22]  Azedine Boulmakoul,et al.  Application of Markov Decision Processes for Modeling and Optimization of Decision-Making within a Container Port , 2011 .

[23]  E. van Kampen,et al.  Nonlinear Adaptive Flight Control Using Incremental Approximate Dynamic Programming and Output Feedback , 2017 .

[24]  E. van Kampen,et al.  Autonomous Navigation in Partially Observable Environments Using Hierarchical Q-Learning , 2016 .

[25]  R. Bellman Dynamic programming. , 1957, Science.

[26]  Sridhar Mahadevan,et al.  Approximate planning with hierarchical partially observable Markov decision process models for robot navigation , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[27]  Bilal H. Abed-alguni,et al.  A multi-agent cooperative reinforcement learning model using a hierarchy of consultants, tutors and workers , 2015, Vietnam Journal of Computer Science.

[28]  Olivier Buffet,et al.  Markov Decision Processes in Artificial Intelligence , 2010 .

[29]  Monireh Abdoos,et al.  Traffic light control in non-stationary environments based on multi agent Q-learning , 2011, 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC).

[30]  John J. Leonard,et al.  Dynamic pose graph SLAM: Long-term mapping in low dynamic environments , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31]  Xin Xu,et al.  A hierarchical path planning approach based on A⁎ and least-squares policy iteration for mobile robots , 2015, Neurocomputing.

[32]  Shinya Takamuku,et al.  Multi-method learning and assimilation , 2007, Robotics Auton. Syst..

[33]  Edwin K. P. Chong,et al.  UAV Path Planning in a Dynamic Environment via Partially Observable Markov Decision Process , 2013, IEEE Transactions on Aerospace and Electronic Systems.

[34]  I. Kim,et al.  Adaptive weighted sum method for multiobjective optimization: a new method for Pareto front generation , 2006 .

[35]  Alexei Makarenko,et al.  Parametric POMDPs for planning in continuous state spaces , 2006, Robotics Auton. Syst..

[36]  Shuai Li,et al.  A biologically inspired solution to simultaneous localization and consistent mapping in dynamic environments , 2013, Neurocomputing.

[37]  Sridhar Mahadevan,et al.  Hierarchical multi-agent reinforcement learning , 2001, AGENTS '01.

[38]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[39]  Marc Toussaint,et al.  Temporally extended features in model-based reinforcement learning with partial observability , 2016, Neurocomputing.

[40]  Peter Stone,et al.  Transfer Learning via Inter-Task Mappings for Temporal Difference Learning , 2007, J. Mach. Learn. Res..

[41]  Mostafa Mehdipour-Ghazi,et al.  Plant identification using deep neural networks via optimization of transfer learning parameters , 2017, Neurocomputing.

[42]  Frank L. Lewis,et al.  Reinforcement learning and optimal adaptive control: An overview and implementation examples , 2012, Annu. Rev. Control..

[43]  Bernhard Hengst,et al.  Discovering Hierarchy in Reinforcement Learning with HEXQ , 2002, ICML.

[44]  Haibo He,et al.  Goal Representation Heuristic Dynamic Programming on Maze Navigation , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[45]  Aram Kawewong,et al.  Online and Incremental Appearance-based SLAM in Highly Dynamic Environments , 2011, Int. J. Robotics Res..

[46]  Chris Eliasmith,et al.  A neural model of hierarchical reinforcement learning , 2017, CogSci.

[47]  Claudio Moraga,et al.  A robust and flexible model of hierarchical self-organizing maps for non-stationary environments , 2007, Neurocomputing.

[48]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[49]  Ronald E. Parr,et al.  Hierarchical control and learning for markov decision processes , 1998 .

[50]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[51]  Erik-Jan Van Kampen,et al.  An Incremental Approximate Dynamic Programming Flight Controller Based on Output Feedback , 2016 .

[52]  Richard Dearden,et al.  Planning to see: A hierarchical approach to planning visual actions on a robot using POMDPs , 2010, Artif. Intell..

[53]  Gregory Ditzler,et al.  Learning in Nonstationary Environments: A Survey , 2015, IEEE Computational Intelligence Magazine.

[54]  Manuel Graña,et al.  Transfer learning with Partially Constrained Models: Application to reinforcement learning of linked multicomponent robot system control , 2013, Robotics Auton. Syst..

[55]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[56]  Nicholas Roy,et al.  Efficient Planning under Uncertainty with Macro-actions , 2014, J. Artif. Intell. Res..

[57]  Sebastien Glaser,et al.  Simultaneous Localization and Mapping: A Survey of Current Trends in Autonomous Driving , 2017, IEEE Transactions on Intelligent Vehicles.

[58]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[59]  Edwin K. P. Chong,et al.  A POMDP Framework for Coordinated Guidance of Autonomous UAVs for Multitarget Tracking , 2009, EURASIP J. Adv. Signal Process..

[60]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[61]  JiGuan G. Lin On min-norm and min-max methods of multi-objective optimization , 2005, Math. Program..

[62]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[63]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[64]  Ye Zhou,et al.  Incremental approximate dynamic programming for nonlinear flight control design , 2015 .