Infinite time horizon maximum causal entropy inverse reinforcement learning

We extend the maximum causal entropy framework for inverse reinforcement learning to the infinite time horizon discounted reward setting. To do so, we maximize discounted future contributions to causal entropy subject to a discounted feature expectation matching constraint. A parameterized class of stochastic policies that solve this problem are referred to as soft Bellman policies because they can be specified in terms of values that satisfy an equation identical to the Bellman equation but with a softmax (the log of a sum of exponentials) instead of a max. Under some assumptions, algorithms that repeatedly solve for a soft Bellman policy, evaluate the policy, and then perform a gradient update on the parameters will find the optimal soft Bellman policy. For the first step, we extend techniques from dynamic programming and reinforcement learning so that they derive soft Bellman policies. For the second step, we can use policy evaluation techniques from dynamic programming or perform Monte Carlo simulations. We compare three algorithms of this type by applying them to a problem instance involving demonstration data from a simple controlled queuing network model inspired by problems in air traffic management.

[1]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[3]  Gerhard Kramer,et al.  Directed information for channels with feedback , 1998 .

[4]  Tim Hesterberg,et al.  Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control , 2004, Technometrics.

[5]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[6]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[7]  Roger G. Ghanem,et al.  Asymptotic Sampling Distribution for Polynomial Chaos Representation of Data: A Maximum Entropy and Fisher information approach , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[8]  J. Andrew Bagnell,et al.  Maximum margin planning , 2006, ICML.

[9]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[10]  Lars Peter Hansen,et al.  Robust control and model misspecification , 2006, J. Econ. Theory.

[11]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[12]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[13]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[14]  Anind K. Dey,et al.  Modeling Interaction via the Principle of Maximum Causal Entropy , 2010, ICML.

[15]  Jerome Le Ny,et al.  Feedback control of the National Airspace System to mitigate weather disruptions , 2010, 49th IEEE Conference on Decision and Control (CDC).

[16]  Jan Peters,et al.  Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[17]  Brian D. Ziebart Factorized decision forecasting via combining value-based and reward-based estimation , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[18]  Er Meng Joo,et al.  A review of inverse reinforcement learning theory and recent advances , 2012, IEEE Congress on Evolutionary Computation.

[19]  Anind K. Dey,et al.  The Principle of Maximum Causal Entropy for Estimating Interacting Processes , 2013, IEEE Transactions on Information Theory.

[20]  Yi Zhou,et al.  Dynamic Queuing Network Model for Flow Contingency Management , 2011, IEEE Transactions on Intelligent Transportation Systems.

[21]  Michael Bloem,et al.  Ground Delay Program Analytics with Behavioral Cloning and Inverse Reinforcement Learning , 2014, J. Aerosp. Inf. Syst..

[22]  Wolfram Burgard,et al.  Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics , 2016, AISTATS.