Covariance analysis as a measure of policy robustness

In this paper we propose covariance analysis as a metric for reinforcement learning to improve the robustness of a learned policy. The local optima found during the exploration are analyzed in terms of the total cumulative reward and the local behavior of the system in the neighborhood of the optima. The analysis is performed in the solution space to select a policy that exhibits robustness in uncertain and noisy environments. We demonstrate the utility of the method using our previously developed system where an autonomous underwater vehicle (AUV) has to recover from a thruster failure [1]. When a failure is detected the recovery system is invoked, which uses simulations to learn a new controller that utilizes the remaining functioning thrusters to achieve the goal of the AUV, that is, to reach a target position. In this paper, we use covariance analysis to examine the performance of the top, n, policies output by the previous algorithm. We propose a scoring metric that uses the output of the covariance analysis, the time it takes the AUV to reach the target position and the distance between the target position and the AUV's final position. The top polices are simulated in a noisy environment and evaluated using the proposed scoring metric to analyze the effect of noise on their performance. The policy that exhibits more tolerance to noise is selected. We show experimental results where covariance analysis successfully selects a more robust policy that was ranked lower by the original algorithm.

[1]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[2]  Konstantinos Kyriakopoulos,et al.  Persistent Autonomy: the Challenges of the PANDORA Project , 2012 .

[3]  Matteo Leonetti,et al.  On-line learning to recover from thruster failures on Autonomous Underwater Vehicles , 2013, 2013 OCEANS - San Diego.

[4]  Konstantinos Kyriakopoulos,et al.  PANDORA - Persistent Autonomy Through Learning, Adaptation, Observation and Replanning , 2012 .

[5]  Marc Carreras,et al.  Girona 500 AUV: From Survey to Intervention , 2012, IEEE/ASME Transactions on Mechatronics.

[6]  Darwin G. Caldwell,et al.  On-line identification of autonomous underwater vehicles through global derivative-free optimization , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[7]  Kostas J. Kyriakopoulos,et al.  Towards semi-autonomous operation of under-actuated underwater vehicles: sensor fusion, on-line identification and visual servo control , 2011, Auton. Robots.

[8]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[9]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[10]  Nikolaus Hansen,et al.  The CMA Evolution Strategy: A Comparing Review , 2006, Towards a New Evolutionary Computation.

[11]  Jan Peters,et al.  Learning motor primitives for robotics , 2009, 2009 IEEE International Conference on Robotics and Automation.

[12]  Olivier Sigaud,et al.  Path Integral Policy Improvement with Covariance Matrix Adaptation , 2012, ICML.

[13]  Dagmar Sternad,et al.  Neuromotor Noise, Error Tolerance and Velocity-Dependent Costs in Skilled Performance , 2011, PLoS Comput. Biol..

[14]  Alex M. Andrew,et al.  ROBOT LEARNING, edited by Jonathan H. Connell and Sridhar Mahadevan, Kluwer, Boston, 1993/1997, xii+240 pp., ISBN 0-7923-9365-1 (Hardback, 218.00 Guilders, $120.00, £89.95). , 1999, Robotica (Cambridge. Print).

[15]  Darwin G. Caldwell,et al.  Challenges for the policy representation when applying reinforcement learning in robotics , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[16]  Persistent Autonomy through Learning , Adaptation , Observation and Replanning DELIVERABLE 1 . 4 Diagnosing and Predicting Task Failure , .

[17]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[18]  Darwin G. Caldwell,et al.  Online discovery of AUV control policies to overcome thruster failures , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Dirk P. Kroese,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning , 2004 .