APRIL : Active Preference-based Reinforcement Learning

This work tackles in-situ robotics: the goal is to learn a policy while the robot operates in the real-world, with neither ground truth nor rewards. The proposed approach is based on preference-based policy learning: Iteratively, the robot demonstrates a few policies, is informed of the expert’s preferences about the demonstrated policies, constructs a utility function compatible with all expert preferences, uses it in a self-training phase, and demonstrates in the next iteration a new policy. While in previous work, the new policy was one maximizing the current utility function, this paper uses active ranking to select the most informative policy (Viappiani and Boutilier 2010). The challenge is the following. The policy return estimate (the expert’s approximate preference function) learned from the policy parametric space, referred to as direct representation, fails to give any useful information; indeed, arbitrary small modifications of the direct policy representation can produce significantly different behaviors, and thus entail different appreciations from the expert. A behavioral policy space, referred to as indirect representation and automatically built from the sensori-motor data stream generated by the operating robot, is therefore devised and used to express the policy return estimate. In the meanwhile, active ranking criteria are classically expressed w.r.t. the explicit domain representation − here the direct policy representation. A novelty of the paper is to show how active ranking can be achieved through black-box optimization on the indirect policy representation. Two experiments in single and two-robot settings are used to illustrate the approach.

[1]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[2]  A. J. Booker,et al.  A rigorous framework for optimization of expensive functions by surrogates , 1998 .

[3]  Andrew W. Moore,et al.  Rates of Convergence for Variable Resolution Schemes in Optimal Control , 2000, ICML.

[4]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[5]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[6]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[7]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[8]  Aude Billard,et al.  On Learning, Representing, and Generalizing a Task in a Humanoid Robot , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  Nando de Freitas,et al.  Active Preference Learning with Discrete Choice Data , 2007, NIPS.

[10]  Pieter Abbeel,et al.  Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion , 2007, NIPS.

[11]  Nello Cristianini,et al.  Machine Learning and Knowledge Discovery in Databases (ECML PKDD) , 2010 .

[12]  Scott Kuindersma,et al.  Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories , 2010, NIPS.

[13]  Craig Boutilier,et al.  Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets , 2010, NIPS.

[14]  Michèle Sebag,et al.  Preference-Based Policy Learning , 2011, ECML/PKDD.

[15]  Eyke Hüllermeier,et al.  Preference-Based Policy Iteration: Leveraging Preference Learning for Reinforcement Learning , 2011, ECML/PKDD.

[16]  Marc Schoenauer,et al.  Preference-based Reinforcement Learning , 2011 .