A Bayesian Approach for Policy Learning from Trajectory Preference Queries

We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent's goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection.

[1]  D. Lindley On a Measure of the Information Provided by an Experiment , 1956 .

[2]  J. Bernardo Expected Information as Expected Utility , 1979 .

[3]  A. Kennedy,et al.  Hybrid Monte Carlo , 1988 .

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[6]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[7]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[8]  C. Boutilier,et al.  Accelerating Reinforcement Learning through Implicit Imitation , 2003, J. Artif. Intell. Res..

[9]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[10]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[11]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[12]  Wei Chu,et al.  Preference learning with Gaussian processes , 2005, ICML.

[13]  Lyle H. Ungar,et al.  Machine Learning manuscript No. (will be inserted by the editor) Active Learning for Logistic Regression: , 2007 .

[14]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[15]  Michèle Sebag,et al.  Preference-Based Policy Learning , 2011, ECML/PKDD.

[16]  Eyke Hüllermeier,et al.  Preference-Based Policy Iteration: Leveraging Preference Learning for Reinforcement Learning , 2011, ECML/PKDD.