论文信息 - Surprise and Curiosity for Big Data Robotics

Surprise and Curiosity for Big Data Robotics

This paper introduces a new perspective on curiosity and intrinsic motivation, viewed as the problem of generating behavior data for parallel off-policy learning. We provide 1) the first measure of surprise based on off-policy general value function learning progress, 2) the first investigation of reactive behavior control with parallel gradient temporal difference learning and function approximation, and 3) the first demonstration of using curiosity driven control to react to a non-stationary learning task—all on a mobile robot. Our approach improves scalability over previous off-policy, robot learning systems, essential for making progress on the ultimate big-data decision making problem—life-long robot learning. Off-policy, life-long, robot learning is an immense big-data decision making problem. In life-long learning the agent’s task is to learn from an effectively infinite stream of interaction. For example, a robot updating at 100 times a second, running 8 hours a day, with a few dozen sensors can produce over a 100 gigabytes of raw observation data every year of its life. Beyond the temporal scale of the problem, off-policy life-long learning enables additional scaling in the number of things that can be learned in parallel, as demonstrated by recent predictive, learning systems (see Modayil et al 2012, White et al 2013). A special challenge in off-policy, life-long learning is to select actions in way that provides effective training data for potentially thousands or millions of prediction learners with diverse needs, which is the subject of this study. Surprise and curiosity play an important role in any learning system. These ideas have been explored in the context of option learning (Singh et al 2005, Simsek and Barto 2006, Schembri et al 2007), developmental robot exploration (Schmidhuber 1991, Oudeyer et al, 2007), and exploration and exploitation in reinforcement learning (see Baldassarre and Mirolli 2013 for an overview). Informally, surprise is an unexpected prediction error. For example, a robot might be surprised about its current draw as it drives across sand for the first time. An agent might be surprised if its reward function suddenly changed sign, producing large unexpected negative rewards. An agent should, however, be unsurprised Copyright c © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. if its prediction of future sensory events falls within the error induced by sensor noise. Equipped with a measure of surprise, an agent can react—change how it is behaving—to unexpected situations to encourage relearning. This reactive adaptation we call curious behavior. In this paper we study how surprise and curiosity can be used to adjust a robot’s behavior in the face of changing world. In particular, we focus on the situation where a robot has already learned two off-policy predictions about two distinct policies. The robot then experiences a physical change that significantly impacts the predictive accuracy of a single prediction. The robot observes its own inability to accurately predict future battery current draw when it executes a rotation command, exciting its internal surprise measure. The robot’s behavior responds by selecting actions to speed relearning of the incorrect prediction—spinning in place until the robot is no longer surprised—then returning to normal operation. This paper provides the first empirical demonstration of surprise and curiosity based on off-policy learning progress on a mobile robot. Our specific instantiation of surprise is based on the instantaneous temporal difference error, rather than novelty, salience, or predicted error (all explored in previous work). Our measure is unique because 1) it balances knowledge and competence-based learning and 2) it uses error generated by off-policy reinforcement learning algorithms on real robot data. Our experiment uses commodity off-the-shelf iRobot Create and simple camera resulting in real-time adaptive control with visual features. We focus on the particular case of responding to a dramatic increase in surprise due to a change in the world—rather than initial learning. The approach described in this paper scales naturally to massive temporal streams, high dimensional features, and many independent off-policy learners common in life-long robot learning.

R. Sutton | Joseph Modayil | Adam White

[1] Jürgen Schmidhuber,et al. A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[2] Sebastian Thrun,et al. Lifelong robot learning , 1993, Robotics Auton. Syst..

[3] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[4] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5] Andrew G. Barto,et al. An intrinsic reward mechanism for efficient exploration , 2006, ICML.

[6] G. Baldassarre,et al. Evolving internal reinforcers for an intrinsically motivated reinforcement-learning robot , 2007, 2007 IEEE 6th International Conference on Development and Learning.

[7] Pierre-Yves Oudeyer,et al. Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[8] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[9] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[10] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[11] Richard S. Sutton,et al. Scaling life-long off-policy learning , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[12] Marco Mirolli,et al. Intrinsically Motivated Learning in Natural and Artificial Systems , 2013 .

[13] Richard S. Sutton,et al. Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..