Active body schema learning

Humanoid and industrial robots are becoming increasingly complex. Most of the algorithms to control these systems require, or are improved with, a full kinematic model of the system. In this work, we are interested in autonomously learning the kinematic description, a.k.a. body schema, for unknown systems. Even for a calibrated system, the ability to continuously tune its parameters allows the system to cope with failures, wear and tear and modifications of the kinematic chain. A natural inspiration for this type of learning is biology. Animals have knowledge about its own body allowing them to perform a large variety of motor tasks. Even after an injury, including the extreme case of limb removal, they can adapt their actions and continue to act [1]. Inspired by the way humans develop, previous work on body schema learning has focused on computing the kinematic model based on random motions [2, 3]. Most of these works require quite a lot of data to converge to the right solution and pointed out that active strategies can help to reduce this complexity. In this paper, we study an active approach to the body schema learning. The main idea is to select motions that are more informative with respect to the current knowledge about the body schema. To apply an active strategy for learning, the robot is described using a physical parametric model of the body schema (e.g. the joint locations and orientations of a robotic arm). Learning the body schema is then a parameter estimation problem. We adopt a Bayesian perspective and compute the posterior of these parameters based on observations of the end-effector of the arm and the motion commands (arm configurations). The distribution over the parameters is sequentially updated using Recursive Least Squares (RLS) when new observations are gathered. Having this posterior, it is possible to formulate an optimal exploration strategy. Unfortunately this results in an NP-complete problem. To overcome this limitation we use model predictive control, i.e. a fixed finite horizon, and perform a policy search over this horizon. This search on the space of policies is again a complex procedure, and we use an anytime adaptive online algorithm [4] to select the best policy at every time step. Standard tehcniques such as linear-quadratic-Gaussian (LQG) controllers are not directly applicable because the model is nonlinear and non-Gaussian and the cost function is not quadratic. Also, since the action and parameter spaces are large-dimensional and continuous, one cannot use methods based on discretization of the problem [5]. Our approach is based on an global optimization method using a response surface of the expected cost of the actions. This surface is modeled as a Gaussian Process (GP), which allows to design exploration strategies in an intelligent way, i.e., based on Bayesian design of new queries. Also, in this work, we introduce some improvements on the algorithm from [4] that reduce the standard GP prediction cost O(n) to O(n). Intuitively, the process is as follows. To find a new exploration point, we start randomly sampling a small set of policies (robot configurations) and computing their expected cost. Based on this values, we