Reinforcement Learning via Fenchel-Rockafellar Duality

We review basic concepts of convex duality, focusing on the very general and supremely useful Fenchel-Rockafellar duality. We summarize how this duality may be applied to a variety of reinforcement learning (RL) settings, including policy evaluation or optimization, online or offline learning, and discounted or undiscounted rewards. The derivations yield a number of intriguing results, including the ability to perform policy evaluation and on-policy policy gradient with behavior-agnostic offline data and methods to learn a policy via max-likelihood optimization. Although many of these results have appeared previously in various forms, we provide a unified treatment and perspective on these results, which we hope will enable researchers to better use and apply the tools of convex duality to make further progress in RL.

[1]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[2]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[3]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[4]  Le Song,et al.  Learning from Conditional Distributions via Dual Embeddings , 2016, AISTATS.

[5]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[6]  Mengdi Wang,et al.  Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time , 2017, 1704.01869.

[7]  Bo Dai,et al.  GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[8]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9]  Heinz H. Bauschke,et al.  What is... a Fenchel Conjugate , 2012 .

[10]  Gergely Neu,et al.  Faster saddle-point optimization for solving large-scale Markov decision processes , 2020, L4DC.

[11]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[12]  Doina Precup,et al.  A new Q(lambda) with interim forward view and Monte Carlo equivalence , 2014, ICML.

[13]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[14]  Marie Davidian,et al.  Doubly robust estimation of causal effects. , 2011, American journal of epidemiology.

[15]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[16]  Le Song,et al.  Boosting the Actor with Dual Critic , 2017, ICLR.

[17]  Bruno Scherrer,et al.  Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[18]  Bo Dai,et al.  DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[19]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[20]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[21]  Jan Peters,et al.  f-Divergence constrained policy improvement , 2017, ArXiv.

[22]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[23]  Siddhartha Srinivasa,et al.  Imitation Learning as f-Divergence Minimization , 2019, WAFR.

[24]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[25]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[26]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[27]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[28]  Lihong Li,et al.  Scalable Bilinear π Learning Using State and Action Features , 2018, ICML 2018.

[29]  Richard Zemel,et al.  A Divergence Minimization Perspective on Imitation Learning Methods , 2019, CoRL.

[30]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[31]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[32]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[33]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[34]  Michael Bowling,et al.  Dual Representations for Dynamic Programming , 2008 .

[35]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[36]  D. Barrios-Aranibar,et al.  LEARNING FROM DELAYED REWARDS USING INFLUENCE VALUES APPLIED TO COORDINATION IN MULTI-AGENT SYSTEMS , 2007 .

[37]  Masatoshi Uehara,et al.  Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes , 2019, ArXiv.

[38]  Ilya Kostrikov,et al.  Imitation Learning via Off-Policy Distribution Matching , 2019, ICLR.

[39]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[40]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[41]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[42]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[43]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[44]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[45]  E. Denardo On Linear Programming in a Markov Decision Problem , 1970 .

[46]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[47]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[48]  Mengdi Wang,et al.  Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning , 2016, ArXiv.

[49]  A. Eigen-analysis Stochastic Variance Reduction Methods for Policy Evaluation , 2017 .

[50]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[51]  Qiang Liu,et al.  Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation , 2019, ICLR.

[52]  Le Song,et al.  Exponential Family Estimation via Adversarial Dynamics Embedding , 2019, NeurIPS.

[53]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[54]  A. S. Manne Linear Programming and Sequential Decisions , 1960 .

[55]  R. Sutton,et al.  A new Q ( � ) with interim forward view and Monte Carlo equivalence , 2014 .

[56]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[57]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[58]  Ilya Kostrikov,et al.  AlgaeDICE: Policy Gradient from Arbitrary Experience , 2019, ArXiv.

[59]  H. Francis Song,et al.  V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.