Learning to soar: Resource-constrained exploration in reinforcement learning

This paper examines temporal difference reinforcement learning with adaptive and directed exploration for resource-limited missions. The scenario considered is that of an unpowered aerial glider learning to perform energy-gaining flight trajectories in a thermal updraft. The presented algorithm, eGP-SARSA(λ), uses a Gaussian process regression model to estimate the value function in a reinforcement learning framework. The Gaussian process also provides a variance on these estimates that is used to measure the contribution of future observations to the Gaussian process value function model in terms of information gain. To avoid myopic exploration we developed a resource-weighted objective function that combines an estimate of the future information gain using an action rollout with the estimated value function to generate directed explorative action sequences. A number of modifications and computational speed-ups to the algorithm are presented along with a standard GP-SARSA(λ) implementation with ε -greedy exploration to compare the respective learning performances. The results show that under this objective function, the learning agent is able to continue exploring for better state-action trajectories when platform energy is high and follow conservative energy-gaining trajectories when platform energy is low.

[1]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[2]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[3]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[4]  J. B. Kernan,et al.  An Information‐Theoretic Approach* , 1971 .

[5]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[6]  Anouck Girard,et al.  Atmospheric flow field models applicable for aircraft endurance extension , 2013 .

[7]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[8]  L. Csató Gaussian processes:iterative sparse approximations , 2002 .

[9]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  Andreas Krause,et al.  Efficient Informative Sensing using Multiple Robots , 2014, J. Artif. Intell. Res..

[12]  Kee-Eung Kim,et al.  Cost-Sensitive Exploration in Bayesian Reinforcement Learning , 2012, NIPS.

[13]  RoyNicholas,et al.  Trajectory Optimization using Reinforcement Learning for Map Exploration , 2008 .

[14]  Nicholas Roy,et al.  Trajectory Optimization using Reinforcement Learning for Map Exploration , 2008, Int. J. Robotics Res..

[15]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[16]  Lehel Csató,et al.  Sparse On-Line Gaussian Processes , 2002, Neural Computation.

[17]  Nicholas R. J. Lawrance,et al.  Autonomous Exploration of a Wind Field with a Gliding Aircraft , 2011 .

[18]  Doina Precup,et al.  An information-theoretic approach to curiosity-driven reinforcement learning , 2012, Theory in Biosciences.

[19]  J. How,et al.  Information-rich Path Planning with General Constraints using Rapidly-exploring Random Trees , 2010 .

[20]  Vincenzo Caglioti,et al.  An information-based exploration strategy for environment mapping with mobile robots , 2010, Robotics Auton. Syst..

[21]  Gaurav S. Sukhatme,et al.  Optimizing waypoints for monitoring spatiotemporal phenomena , 2013, Int. J. Robotics Res..

[22]  Nicholas R. J. Lawrance,et al.  Gaussian processes for informative exploration in reinforcement learning , 2013, 2013 IEEE International Conference on Robotics and Automation.

[23]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[24]  Geoffrey A. Hollinger,et al.  Active planning for underwater inspection and the benefit of adaptivity , 2012, Int. J. Robotics Res..

[25]  Carl E. Rasmussen,et al.  Gaussian process dynamic programming , 2009, Neurocomputing.

[26]  Wolfram Burgard,et al.  Information Gain-based Exploration Using Rao-Blackwellized Particle Filters , 2005, Robotics: Science and Systems.