Consistent On-Line Off-Policy Evaluation

The problem of on-line off-policy evaluation (OPE) has been actively studied in the last decade due to its importance both as a stand-alone problem and as a module in a policy improvement scheme. However, most Temporal Difference (TD) based solutions ignore the discrepancy between the stationary distribution of the behavior and target policies and its effect on the convergence limit when function approximation is applied. In this paper we propose the Consistent Off-Policy Temporal Difference (COP-TD($\lambda$, $\beta$)) algorithm that addresses this issue and reduces this bias at some computational expense. We show that COP-TD($\lambda$, $\beta$) can be designed to converge to the same value that would have been obtained by using on-policy TD($\lambda$) with the target policy. Subsequently, the proposed scheme leads to a related and promising heuristic we call log-COP-TD($\lambda$, $\beta$). Both algorithms have favorable empirical results to the current state of the art on-line OPE algorithms. Finally, our formulation sheds some new light on the recently proposed Emphatic TD learning.

[1]  Shimon Whiteson,et al.  Automatic Feature Selection for Model-Based Reinforcement Learning in Factored MDPs , 2009, 2009 International Conference on Machine Learning and Applications.

[2]  Shie Mannor,et al.  Graying the black box: Understanding DQNs , 2016, ICML.

[3]  Huizhen Yu,et al.  On Convergence of Emphatic Temporal-Difference Learning , 2015, COLT.

[4]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[5]  Richard S. Sutton,et al.  Off-policy TD( l) with a true online equivalence , 2014, UAI.

[6]  Ding Wang,et al.  Feature selection and feature learning for high-dimensional batch reinforcement learning: A survey , 2015, International Journal of Automation and Computing.

[7]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[8]  Enda Barrett,et al.  Applying reinforcement learning towards automating resource allocation and application scalability in the cloud , 2013, Concurr. Comput. Pract. Exp..

[9]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[10]  Takafumi Kanamori,et al.  A Least-squares Approach to Direct Importance Estimation , 2009, J. Mach. Learn. Res..

[11]  Jian Wang,et al.  A novel approach for constructing basis functions in approximate dynamic programming for feedback control , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[12]  Richard S. Sutton,et al.  Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[13]  Martha White,et al.  Incremental Truncated LSTD , 2015, IJCAI.

[14]  Scott Niekum,et al.  Policy Evaluation Using the \Omega -Return , 2015, NIPS 2015.

[15]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[16]  Matthieu Geist,et al.  Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[19]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[20]  Kok-Lim Alvin Yau,et al.  Application of reinforcement learning to routing in distributed wireless networks: a review , 2013, Artificial Intelligence Review.

[21]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[22]  Yi Sun,et al.  Incremental Basis Construction from Temporal Difference Error , 2011, ICML.

[23]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[24]  Zhiwei Qin,et al.  Sparse Reinforcement Learning via Convex Optimization , 2014, ICML.

[25]  Yunmei Chen,et al.  Projection Onto A Simplex , 2011, 1101.6081.

[26]  Matthew W. Hoffman,et al.  Regularized Least Squares Temporal Difference Learning with Nested ℓ2 and ℓ1 Penalization , 2011, EWRL.

[27]  J. Zico Kolter,et al.  The Fixed Points of Off-Policy TD , 2011, NIPS.

[28]  Moshe Tennenholtz,et al.  Encouraging Physical Activity in Patients With Diabetes Through Automatic Personalized Feedback via Reinforcement Learning Improves Glycemic Control , 2016, Diabetes Care.

[29]  Vivek S. Borkar,et al.  Feature Search in the Grassmanian in Online Reinforcement Learning , 2013, IEEE Journal of Selected Topics in Signal Processing.

[30]  Richard S. Sutton,et al.  Off-policy learning based on weighted importance sampling with linear computational complexity , 2015, UAI.

[31]  Sridhar Mahadevan,et al.  Samuel Meets Amarel: Automating Value Function Approximation Using Global State Space Analysis , 2005, AAAI.

[32]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[33]  Arash Givchi,et al.  Off-policy temporal difference learning with distribution adaptation in fast mixing chains , 2018, Soft Comput..

[34]  Masashi Sugiyama,et al.  Feature Selection for Reinforcement Learning: Evaluating Implicit State-Reward Dependency via Conditional Mutual Information , 2010, ECML/PKDD.

[35]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[36]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[37]  Dean Stephen Wookey Representation discovery using a fixed basis in reinforcement learning , 2016 .

[38]  Marek Petrik,et al.  An Analysis of Laplacian Methods for Value Function Approximation in MDPs , 2007, IJCAI.

[39]  Shie Mannor,et al.  Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis , 2015, AAAI.

[40]  Huizhen Yu,et al.  Convergence of Least Squares Temporal Difference Methods Under General Conditions , 2010, ICML.

[41]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[42]  George D. Konidaris,et al.  Regularized feature selection in reinforcement learning , 2015, Machine Learning.

[43]  Frank L. Lewis,et al.  Adaptive Feature Pursuit: Online Adaptation of Features in Reinforcement Learning , 2013 .

[44]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[45]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[46]  Marek Petrik,et al.  Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes , 2010, ICML.

[47]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[48]  Matthieu Geist,et al.  A Dantzig Selector Approach to Temporal Difference Learning , 2012, ICML.

[49]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[50]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[51]  Shie Mannor,et al.  Adaptive Bases for Reinforcement Learning , 2010, ECML/PKDD.

[52]  D. Bertsekas,et al.  Journal of Computational and Applied Mathematics Projected Equation Methods for Approximate Solution of Large Linear Systems , 2022 .

[53]  Martha White,et al.  Investigating Practical Linear Temporal Difference Learning , 2016, AAMAS.

[54]  Alessandro Lazaric,et al.  LSTD with Random Projections , 2010, NIPS.

[55]  Philip S. Thomas,et al.  Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees , 2015, IJCAI.

[56]  Scott Niekum,et al.  TD_gamma: Re-evaluating Complex Backups in Temporal Difference Learning , 2011, NIPS.

[57]  Ronald Parr,et al.  Linear Complementarity for Regularized Policy Evaluation and Improvement , 2010, NIPS.

[58]  M. Loth,et al.  Sparse Temporal Difference Learning Using LASSO , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[59]  Bo Liu,et al.  Sparse Q-learning with Mirror Descent , 2012, UAI.

[60]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[61]  William D. Smart Explicit Manifold Representations for Value-Function Approximation in Reinforcement Learning , 2004, ISAIM.

[62]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[63]  Philippe Preux,et al.  Basis Expansion in Natural Actor Critic Methods , 2008, EWRL.

[64]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[65]  Ronald Parr,et al.  Greedy Algorithms for Sparse Reinforcement Learning , 2012, ICML.

[66]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[67]  Masashi Sugiyama,et al.  Importance-weighted least-squares probabilistic classifier for covariate shift adaptation with application to human activity recognition , 2012, Neurocomputing.

[68]  Klaus Obermayer,et al.  Construction of approximation spaces for reinforcement learning , 2013, J. Mach. Learn. Res..

[69]  Sridhar Mahadevan,et al.  Constructing basis functions from directed graphs for value function approximation , 2007, ICML '07.

[70]  Doina Precup,et al.  A new Q(lambda) with interim forward view and Monte Carlo equivalence , 2014, ICML.