Inverse Reinforcement Learning with PI 2

We present an algorithm that recovers an unknown cost function from expert-demonstrated trajectories in continuous space. We assume that the cost function is a weighted linear combination of features, and we are able to learn weights that result in a cost function under which the expert demonstrated trajectories are optimal. Unlike previous approaches [1], [2], our algorithm does not require repeated solving of the forward problem (i.e., finding optimal trajectories under a candidate cost function). At the core of our approach is the PI (Policy Improvement with Path Integrals) reinforcement learning algorithm [3], which optimizes a parameterized policy in continuous space and high dimensions. PI boasts convergence that is an order of magnitude faster than previous trajectory-based reinforcement learning algorithms on typical problems. We solve for the unknown cost function by enforcing the constraint that the expert-demonstrated trajectory does not change under the PI update rule, and hence is locally optimal.