Confidence-based progress-driven self-generated goals for skill acquisition in developmental robots

A reinforcement learning agent that autonomously explores its environment can utilize a curiosity drive to enable continual learning of skills, in the absence of any external rewards. We formulate curiosity-driven exploration, and eventual skill acquisition, as a selective sampling problem. Each environment setting provides the agent with a stream of instances. An instance is a sensory observation that, when queried, causes an outcome that the agent is trying to predict. After an instance is observed, a query condition, derived herein, tells whether its outcome is statistically known or unknown to the agent, based on the confidence interval of an online linear classifier. Upon encountering the first unknown instance, the agent “queries” the environment to observe the outcome, which is expected to improve its confidence in the corresponding predictor. If the environment is in a setting where all instances are known, the agent generates a plan of actions to reach a new setting, where an unknown instance is likely to be encountered. The desired setting is a self-generated goal, and the plan of action, essentially a program to solve a problem, is a skill. The success of the plan depends on the quality of the agent's predictors, which are improved as mentioned above. For validation, this method is applied to both a simulated and real Katana robot arm in its “blocks-world” environment. Results show that the proposed method generates sample-efficient curious exploration behavior, which exhibits developmental stages, continual learning, and skill acquisition, in an intrinsically-motivated playful agent.

[1]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[2]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[3]  Nuttapong Chentanez,et al.  Intrinsically Motivated Learning of Hierarchical Collections of Skills , 2004 .

[4]  Marc Toussaint,et al.  Relevance Grounding for Planning in Relational Domains , 2009, ECML/PKDD.

[5]  R. Bellman A Markovian Decision Process , 1957 .

[6]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[7]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[8]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[9]  S. Hochreiter,et al.  REINFORCEMENT DRIVEN INFORMATION ACQUISITION IN NONDETERMINISTIC ENVIRONMENTS , 1995 .

[10]  Yi Sun,et al.  Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments , 2011, AGI.

[11]  Tobias Lang,et al.  Planning and exploration in stochastic relational worlds , 2011 .

[12]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[13]  Jürgen Schmidhuber,et al.  First Experiments with PowerPlay , 2012, Neural networks : the official journal of the International Neural Network Society.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[16]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[17]  Ehud Ahissar,et al.  Reinforcement active learning hierarchical loops , 2011, The 2011 International Joint Conference on Neural Networks.

[18]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[19]  Sebastian Thrun,et al.  Lifelong robot learning , 1993, Robotics Auton. Syst..

[20]  Ring Mark,et al.  Compression Progress-Based Curiosity Drive for Developmental Learning , 2011 .

[21]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[22]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[23]  Peter Stone,et al.  Intrinsically motivated model learning for a developing curious agent , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[24]  David A. Cohn,et al.  Training Connectionist Networks with Queries and Selective Sampling , 1989, NIPS.

[25]  Francesco Orabona,et al.  Better Algorithms for Selective Sampling , 2011, ICML.

[26]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[27]  Jürgen Schmidhuber,et al.  Learning to Generate Artificial Fovea Trajectories for Target Detection , 1991, Int. J. Neural Syst..

[28]  Andrew G. Barto,et al.  An intrinsic reward mechanism for efficient exploration , 2006, ICML.

[29]  D. Berlyne Curiosity and exploration. , 1966, Science.

[30]  Paul Bourgine,et al.  Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty , 1999, Machine Learning.

[31]  J. Piaget The child's construction of reality , 1954 .

[32]  Claudio Gentile,et al.  A Second-Order Perceptron Algorithm , 2002, SIAM J. Comput..

[33]  Dana H. Ballard,et al.  Active Perception and Reinforcement Learning , 1990, Neural Computation.

[34]  Jürgen Schmidhuber,et al.  Efficient model-based exploration , 1998 .

[35]  Allen Newell,et al.  Report on a general problem-solving program , 1959, IFIP Congress.

[36]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[37]  Jürgen Schmidhuber,et al.  Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts , 2006, Connect. Sci..

[38]  Jean-Yves Bouguet,et al.  Camera calibration toolbox for matlab , 2001 .

[39]  Masaki Ogino,et al.  Cognitive Developmental Robotics: A Survey , 2009, IEEE Transactions on Autonomous Mental Development.

[40]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[41]  Mark B. Ring Child: A First Step Towards Continual Learning , 1998, Learning to Learn.

[42]  Claudio Gentile,et al.  Selective sampling and active learning from single and multiple teachers , 2012, J. Mach. Learn. Res..

[43]  Jürgen Schmidhuber,et al.  Exploring the predictable , 2003 .

[44]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[45]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[46]  Jurgen Schmidhuber,et al.  Artificial curiosity with planning for autonomous perceptual and cognitive development , 2011, 2011 IEEE International Conference on Development and Learning (ICDL).

[47]  Shimon Whiteson,et al.  V-MAX: tempered optimism for better PAC reinforcement learning , 2012, AAMAS.

[48]  Richard S. Sutton,et al.  Reinforcement Learning , 1992, Handbook of Machine Learning.

[49]  Pierre-Yves Oudeyer,et al.  Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[50]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[51]  George Konidaris,et al.  Autonomous Robot Skill Acquisition , 2008, AAAI.

[52]  James L. McClelland,et al.  Autonomous Mental Development by Robots and Animals , 2001, Science.

[53]  Giulio Sandini,et al.  Developmental robotics: a survey , 2003, Connect. Sci..

[54]  Claudio Gentile,et al.  Linear Classification and Selective Sampling Under Low Noise Conditions , 2008, NIPS.

[55]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[56]  Javier R. Movellan,et al.  Infomax Control of Eye Movements , 2010, IEEE Transactions on Autonomous Mental Development.

[57]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[58]  Claudio Gentile,et al.  Robust bounds for classification via selective sampling , 2009, ICML '09.

[59]  Corso Elvezia What's Interesting? , 1997 .

[60]  V. Vovk Competitive On‐line Statistics , 2001 .

[61]  Jürgen Schmidhuber,et al.  Learning skills from play: Artificial curiosity on a Katana robot arm , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[62]  Mark B. Ring CHILD: A First Step Towards Continual Learning , 1997, Machine Learning.