论文信息 - Reinforcement Learning via Fenchel-Rockafellar Duality - 字舞流文

Reinforcement Learning via Fenchel-Rockafellar Duality

We review basic concepts of convex duality, focusing on the very general and supremely useful Fenchel-Rockafellar duality. We summarize how this duality may be applied to a variety of reinforcement learning (RL) settings, including policy evaluation or optimization, online or offline learning, and discounted or undiscounted rewards. The derivations yield a number of intriguing results, including the ability to perform policy evaluation and on-policy policy gradient with behavior-agnostic offline data and methods to learn a policy via max-likelihood optimization. Although many of these results have appeared previously in various forms, we provide a unified treatment and perspective on these results, which we hope will enable researchers to better use and apply the tools of convex duality to make further progress in RL.

Bo Dai | Ofir Nachum | Ofir Nachum | Bo Dai

[1] Masatoshi Uehara,et al. Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[2] David K. Smith,et al. Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[3] Chris Watkins,et al. Learning from delayed rewards , 1989 .

[4] Le Song,et al. Learning from Conditional Distributions via Dual Embeddings , 2016, AISTATS.

[5] J. Borwein,et al. Convex Analysis And Nonlinear Optimization , 2000 .

[6] Mengdi Wang,et al. Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time , 2017, 1704.01869.

[7] Bo Dai,et al. GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[8] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9] Heinz H. Bauschke,et al. What is... a Fenchel Conjugate , 2012 .

[10] Gergely Neu,et al. Faster saddle-point optimization for solving large-scale Markov decision processes , 2020, L4DC.

[11] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .

[12] Doina Precup,et al. A new Q(lambda) with interim forward view and Monte Carlo equivalence , 2014, ICML.

[13] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[14] Marie Davidian,et al. Doubly robust estimation of causal effects. , 2011, American journal of epidemiology.

[15] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[16] Le Song,et al. Boosting the Actor with Dual Critic , 2017, ICLR.

[17] Bruno Scherrer,et al. Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[18] Bo Dai,et al. DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[19] Sergey Levine,et al. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[20] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[21] Jan Peters,et al. f-Divergence constrained policy improvement , 2017, ArXiv.

[22] Dimitri P. Bertsekas,et al. Nonlinear Programming , 1997 .

[23] Siddhartha Srinivasa,et al. Imitation Learning as f-Divergence Minimization , 2019, WAFR.

[24] Yuval Tassa,et al. Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[25] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[26] R. Tyrrell Rockafellar,et al. Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[27] S. M. Ali,et al. A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[28] Lihong Li,et al. Scalable Bilinear π Learning Using State and Action Features , 2018, ICML 2018.

[29] Richard Zemel,et al. A Divergence Minimization Perspective on Imitation Learning Methods , 2019, CoRL.

[30] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[31] Michael I. Jordan,et al. Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[32] Richard S. Sutton,et al. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[33] Vicenç Gómez,et al. A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[34] Michael Bowling,et al. Dual Representations for Dynamic Programming , 2008 .

[35] Sebastian Nowozin,et al. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[36] D. Barrios-Aranibar,et al. LEARNING FROM DELAYED REWARDS USING INFLUENCE VALUES APPLIED TO COORDINATION IN MULTI-AGENT SYSTEMS , 2007 .

[37] Masatoshi Uehara,et al. Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes , 2019, ArXiv.

[38] Ilya Kostrikov,et al. Imitation Learning via Off-Policy Distribution Matching , 2019, ICLR.

[39] Benjamin Van Roy,et al. The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[40] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[41] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[42] Richard S. Sutton,et al. Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[43] Sergey Levine,et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[44] Masatoshi Uehara,et al. Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[45] E. Denardo. On Linear Programming in a Markov Decision Problem , 1970 .

[46] Marek Petrik,et al. Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[47] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[48] Mengdi Wang,et al. Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning , 2016, ArXiv.

[49] A. Eigen-analysis. Stochastic Variance Reduction Methods for Policy Evaluation , 2017 .

[50] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[51] Qiang Liu,et al. Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation , 2019, ICLR.

[52] Le Song,et al. Exponential Family Estimation via Adversarial Dynamics Embedding , 2019, NeurIPS.

[53] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[54] A. S. Manne. Linear Programming and Sequential Decisions , 1960 .

[55] R. Sutton,et al. A new Q ( � ) with interim forward view and Monte Carlo equivalence , 2014 .

[56] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[57] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[58] Ilya Kostrikov,et al. AlgaeDICE: Policy Gradient from Arbitrary Experience , 2019, ArXiv.

[59] H. Francis Song,et al. V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.