论文信息 - A maximum-entropy approach to off-policy evaluation in average-reward MDPs - 字舞流文

A maximum-entropy approach to off-policy evaluation in average-reward MDPs

This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the effectiveness of the proposed OPE approaches in multiple environments.

Dilan Görür | Nevena Lazic | Nir Levine | Mehrdad Farajtabar | Dale Schuurmans | Dong Yin | Chris Harris | D. Schuurmans | Mehrdad Farajtabar | Chris Harris | Dilan Görür | Dong Yin | Nir Levine | N. Lazic

[1] Nan Jiang,et al. $Q^\star$ Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison , 2020, 2003.03924.

[2] Bo Dai,et al. Off-Policy Evaluation via the Regularized Lagrangian , 2020, NeurIPS.

[3] Stephen P. Boyd,et al. CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[4] Nir Friedman,et al. Probabilistic Graphical Models - Principles and Techniques , 2009 .

[5] Ilya Kostrikov,et al. AlgaeDICE: Policy Gradient from Arbitrary Experience , 2019, ArXiv.

[6] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7] Jan Peters,et al. Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[8] Jie Chen,et al. Stochastic Gradient Descent with Biased but Consistent Gradient Estimators , 2018, ArXiv.

[9] Arthur Charpentier,et al. the Dirichlet distribution , 2012 .

[10] Csaba Szepesvári,et al. Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[11] Huizhen Yu,et al. Convergence of Least Squares Temporal Difference Methods Under General Conditions , 2010, ICML.

[12] W. Kahan,et al. The Rotation of Eigenvectors by a Perturbation. III , 1970 .

[13] Sham M. Kakade,et al. Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[14] Marek Petrik,et al. Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[15] George Konidaris,et al. Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.

[16] Bo Dai,et al. DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[17] Thomas G. Dietterich. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[18] Xi Chen,et al. Large-Scale Markov Decision Problems via the Linear Programming Dual , 2019, ArXiv.

[19] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20] Tommi S. Jaakkola,et al. Maximum Entropy Discrimination , 1999, NIPS.

[21] Peter L. Bartlett,et al. POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[22] Masatoshi Uehara,et al. Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[23] Qiang Liu,et al. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[24] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[25] Chandler Davis. The rotation of eigenvectors by a perturbation , 1963 .

[26] Richard S. Sutton,et al. Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[27] Mengdi Wang,et al. Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[28] Richard S. Sutton,et al. Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[29] Dimitri P. Bertsekas,et al. Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[30] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[31] Joel A. Tropp,et al. User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[32] Matthieu Geist,et al. Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[33] Reuven Y. Rubinstein,et al. Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[34] Tengyao Wang,et al. A useful variant of the Davis--Kahan theorem for statisticians , 2014, 1405.0680.

[35] Bo Dai,et al. GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[36] Bo Dai,et al. Reinforcement Learning via Fenchel-Rockafellar Duality , 2020, ArXiv.

[37] Steven J. Bradtke,et al. Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[38] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[39] Avinatan Hassidim,et al. Online Linear Quadratic Control , 2018, ICML.

[40] Huan Xu,et al. Large Scale Markov Decision Processes with Changing Rewards , 2019, NeurIPS.

[41] Mengdi Wang,et al. Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation , 2020, ICML.

[42] Qiang Liu,et al. A Kernel Loss for Solving the Bellman Equation , 2019, NeurIPS.

[43] Bo Dai,et al. Batch Stationary Distribution Estimation , 2020, ICML.

[44] Alessandro Lazaric,et al. Finite-Sample Analysis of LSTD , 2010, ICML.

[45] Nikolai Matni,et al. On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.