Boosted and reward-regularized classification for apprenticeship learning

This paper deals with the problem of learning from demonstrations, where an agent called the apprentice tries to learn a behavior from demonstrations of another agent called the expert. To address this problem, we place ourselves into the Markov Decision Process (MDP) framework, which is well suited for sequential decision making problems. A way to tackle this problem is to reduce it to classification but doing so we do not take into account the MDP structure. Other methods which take into account the MDP structure need to solve MDPs which is a difficult task and/or need a choice of features which is problem-dependent. The main contribution of the paper is to extend a large margin approach, which is a classification method, by adding a regularization term which takes into account the MDP structure. The derived algorithm, called Reward-regularized Classification for Apprenticeship Learning (RCAL), does not need to solve MDPs. But, the major advantage is that it can be boosted: this avoids the choice of features, which is a drawback of parametric approaches. A state of the art experiment (Highway) and generic experiments (structured Garnets) are conducted to show the performance of RCAL compared to algorithms from the literature.

[1]  J. Andrew Bagnell,et al.  Generalized Boosting Algorithms for Convex Optimization , 2011, ICML.

[2]  Jan Peters,et al.  Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[3]  K. I. M. McKinnon,et al.  On the Generation of Markov Decision Processes , 1995 .

[4]  Manuel Lopes,et al.  Learning from Demonstration Using MDP Induced Metrics , 2010, ECML/PKDD.

[5]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[6]  Siddhartha S. Srinivasa,et al.  Imitation learning for locomotion and manipulation , 2007, 2007 7th IEEE-RAS International Conference on Humanoid Robots.

[7]  Matthieu Geist,et al.  Learning from Demonstrations: Is It Worth Estimating a Reward Function? , 2013, ECML/PKDD.

[8]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[9]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10]  Thomas G. Dietterich,et al.  Active Imitation Learning via Reduction to I.I.D. Active Learning , 2012, AAAI Fall Symposium: Robots Learning Interactively from Human Teachers.

[11]  Michael H. Bowling,et al.  Apprenticeship learning using linear programming , 2008, ICML '08.

[12]  Csaba Szepesvári,et al.  Training parsers by inverse reinforcement learning , 2009, Machine Learning.

[13]  Matthieu Geist,et al.  Inverse Reinforcement Learning through Structured Classification , 2012, NIPS.

[14]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[15]  Shie Mannor,et al.  Sparse Online Greedy Support Vector Regression , 2002, ECML.

[16]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[17]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[18]  F. Clarke Generalized gradients and applications , 1975 .

[19]  Franziska Wulf,et al.  Minimization Methods For Non Differentiable Functions , 2016 .

[20]  Stuart J. Russell Learning agents for uncertain environments (extended abstract) , 1998, COLT' 98.

[21]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[22]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[23]  Robert E. Schapire,et al.  A Reduction from Apprenticeship Learning to Classification , 2010, NIPS.

[24]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .