Apprenticeship Learning via Frank-Wolfe

We consider the applications of the Frank-Wolfe (FW) algorithm for Apprenticeship Learning (AL). In this setting, there is a Markov Decision Process (MDP), but the reward function is not given explicitly. Instead, there is an expert that acts according to some policy, and the goal is to find a policy whose feature expectations are closest to those of the expert policy. We formulate this problem as finding the projection of the feature expectations of the expert on the feature expectations polytope -- the convex hull of the feature expectations of all the deterministic policies in the MDP. We show that this formulation is equivalent to the AL objective and that solving this problem using the FW algorithm is equivalent to the most known AL algorithm, the projection method of Abbeel andNg (2004). This insight allows us to analyze AL with tools from the convex optimization literature and to derive tighter bounds on AL. Specifically, we show that a variation of the FW method that is based on taking "away steps" achieves a linear rate of convergence when applied to AL. We also show experimentally that this version outperforms the FW baseline. To the best of our knowledge, this is the first work that shows linear convergence rates for AL.

[1]  Yinyu Ye,et al.  The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate , 2011, Math. Oper. Res..

[2]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[3]  Michael H. Bowling,et al.  Apprenticeship learning using linear programming , 2008, ICML '08.

[4]  Elad Hazan,et al.  Faster Rates for the Frank-Wolfe Method over Strongly-Convex Sets , 2014, ICML.

[5]  Haipeng Luo,et al.  Variance-Reduced and Projection-Free Stochastic Optimization , 2016, ICML.

[6]  J. Andrew Bagnell,et al.  Maximum margin planning , 2006, ICML.

[7]  Marc Teboulle,et al.  A conditional gradient method with linear rate of convergence for solving convex linear systems , 2004, Math. Methods Oper. Res..

[8]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[9]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[10]  DAN GARBER,et al.  A Linearly Convergent Variant of the Conditional Gradient Algorithm under Strong Convexity, with Applications to Online and Stochastic Optimization , 2016, SIAM J. Optim..

[11]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[12]  Martin Jaggi,et al.  An Affine Invariant Linear Convergence Analysis for Frank-Wolfe Algorithms , 2013, 1312.7864.

[13]  Elad Hazan,et al.  A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization , 2013, 1301.4666.

[14]  Haim Kaplan,et al.  Average reward reinforcement learning with unknown mixing times , 2019, ArXiv.

[15]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[16]  Shimrit Shtern,et al.  Linearly convergent away-step conditional gradient for non-strongly convex functions , 2015, Math. Program..

[17]  Boris Polyak,et al.  Constrained minimization methods , 1966 .

[18]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[19]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[20]  Peter Bro Miltersen,et al.  Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor , 2010, JACM.

[21]  Elad Hazan,et al.  Projection-free Online Learning , 2012, ICML.

[22]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[23]  Javier Peña,et al.  Polytope Conditioning and Linear Convergence of the Frank-Wolfe Algorithm , 2015, Math. Oper. Res..

[24]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[25]  Haim Kaplan,et al.  Unknown mixing times in apprenticeship and reinforcement learning , 2020, UAI.

[26]  Patrice Marcotte,et al.  Some comments on Wolfe's ‘away step’ , 1986, Math. Program..

[27]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[28]  M. Canon,et al.  A Tight Upper Bound on the Rate of Convergence of Frank-Wolfe Algorithm , 1968 .