暂无分享,去创建一个
Bo Dai | Ofir Nachum | Ofir Nachum | Bo Dai
[1] Masatoshi Uehara,et al. Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.
[2] David K. Smith,et al. Dynamic Programming and Optimal Control. Volume 1 , 1996 .
[3] Chris Watkins,et al. Learning from delayed rewards , 1989 .
[4] Le Song,et al. Learning from Conditional Distributions via Dual Embeddings , 2016, AISTATS.
[5] J. Borwein,et al. Convex Analysis And Nonlinear Optimization , 2000 .
[6] Mengdi Wang,et al. Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time , 2017, 1704.01869.
[7] Bo Dai,et al. GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.
[8] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.
[9] Heinz H. Bauschke,et al. What is... a Fenchel Conjugate , 2012 .
[10] Gergely Neu,et al. Faster saddle-point optimization for solving large-scale Markov decision processes , 2020, L4DC.
[11] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .
[12] Doina Precup,et al. A new Q(lambda) with interim forward view and Monte Carlo equivalence , 2014, ICML.
[13] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.
[14] Marie Davidian,et al. Doubly robust estimation of causal effects. , 2011, American journal of epidemiology.
[15] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..
[16] Le Song,et al. Boosting the Actor with Dual Critic , 2017, ICLR.
[17] Bruno Scherrer,et al. Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.
[18] Bo Dai,et al. DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.
[19] Sergey Levine,et al. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.
[20] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.
[21] Jan Peters,et al. f-Divergence constrained policy improvement , 2017, ArXiv.
[22] Dimitri P. Bertsekas,et al. Nonlinear Programming , 1997 .
[23] Siddhartha Srinivasa,et al. Imitation Learning as f-Divergence Minimization , 2019, WAFR.
[24] Yuval Tassa,et al. Maximum a Posteriori Policy Optimisation , 2018, ICLR.
[25] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .
[26] R. Tyrrell Rockafellar,et al. Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.
[27] S. M. Ali,et al. A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .
[28] Lihong Li,et al. Scalable Bilinear π Learning Using State and Action Features , 2018, ICML 2018.
[29] Richard Zemel,et al. A Divergence Minimization Perspective on Imitation Learning Methods , 2019, CoRL.
[30] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.
[31] Michael I. Jordan,et al. Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..
[32] Richard S. Sutton,et al. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.
[33] Vicenç Gómez,et al. A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.
[34] Michael Bowling,et al. Dual Representations for Dynamic Programming , 2008 .
[35] Sebastian Nowozin,et al. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.
[36] D. Barrios-Aranibar,et al. LEARNING FROM DELAYED REWARDS USING INFLUENCE VALUES APPLIED TO COORDINATION IN MULTI-AGENT SYSTEMS , 2007 .
[37] Masatoshi Uehara,et al. Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes , 2019, ArXiv.
[38] Ilya Kostrikov,et al. Imitation Learning via Off-Policy Distribution Matching , 2019, ICLR.
[39] Benjamin Van Roy,et al. The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..
[40] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.
[41] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.
[42] Richard S. Sutton,et al. Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .
[43] Sergey Levine,et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.
[44] Masatoshi Uehara,et al. Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..
[45] E. Denardo. On Linear Programming in a Markov Decision Problem , 1970 .
[46] Marek Petrik,et al. Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.
[47] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.
[48] Mengdi Wang,et al. Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning , 2016, ArXiv.
[49] A. Eigen-analysis. Stochastic Variance Reduction Methods for Policy Evaluation , 2017 .
[50] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.
[51] Qiang Liu,et al. Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation , 2019, ICLR.
[52] Le Song,et al. Exponential Family Estimation via Adversarial Dynamics Embedding , 2019, NeurIPS.
[53] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .
[54] A. S. Manne. Linear Programming and Sequential Decisions , 1960 .
[55] R. Sutton,et al. A new Q ( � ) with interim forward view and Monte Carlo equivalence , 2014 .
[56] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.
[57] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.
[58] Ilya Kostrikov,et al. AlgaeDICE: Policy Gradient from Arbitrary Experience , 2019, ArXiv.
[59] H. Francis Song,et al. V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.