Regularized Policy Iteration with Nonparametric Function Spaces

We study two regularization-based approximate policy iteration algorithms, namely REG-LSPI and REG-BRM, to solve reinforcement learning and planning problems in discounted Markov Decision Processes with large state and finite action spaces. The core of these algorithms are the regularized extensions of the Least-Squares Temporal Difference (LSTD) learning and Bellman Residual Minimization (BRM), which are used in the algorithms' policy evaluation steps. Regularization provides a convenient way to control the complexity of the function space to which the estimated value function belongs and as a result enables us to work with rich nonparametric function spaces. We derive efficient implementations of our methods when the function space is a reproducing kernel Hilbert space. We analyze the statistical properties of REG-LSPI and provide an upper bound on the policy evaluation error and the performance loss of the policy returned by this method. Our bound shows the dependence of the loss on the number of samples, the capacity of the function space, and some intrinsic properties of the underlying Markov Decision Process. The dependence of the policy evaluation bound on the number of samples is minimax optimal. This is the first work that provides such a strong guarantee for a nonparametric approximate policy iteration algorithm.

[1]  C. J. Stone,et al.  Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[2]  M. Nussbaum Spline Smoothing in Regression Models and Asymptotic Efficiency in $L_2$ , 1985 .

[3]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[4]  M. Nussbaum,et al.  A Risk Bound in Sobolev Class Regression , 1990 .

[5]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[6]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[7]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[8]  P. Doukhan Mixing: Properties and Examples , 1994 .

[9]  Bin Yu RATES OF CONVERGENCE FOR EMPIRICAL PROCESSES OF STATIONARY MIXING SEQUENCES , 1994 .

[10]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[11]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[12]  R. Beatson,et al.  A short course on fast multipole methods , 1997 .

[13]  R. DeVore,et al.  Nonlinear approximation , 1998, Acta Numerica.

[14]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[15]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[16]  Paul-Marie Samson,et al.  Concentration of measure inequalities for Markov chains and $\Phi$-mixing processes , 2000 .

[17]  M. Kohler Inequalities for uniform deviations of averages from expectations with applications to nonparametric regression , 2000 .

[18]  S. Geer Empirical Processes in M-Estimation , 2000 .

[19]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[20]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[21]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[22]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[23]  Ding-Xuan Zhou,et al.  The covering number in learning theory , 2002, J. Complex..

[24]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[25]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[26]  S. Smale,et al.  ESTIMATING THE APPROXIMATION ERROR IN LEARNING THEORY , 2003 .

[27]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[28]  Ding-Xuan Zhou,et al.  Capacity of reproducing kernel spaces in learning theory , 2003, IEEE Transactions on Information Theory.

[29]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[30]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[31]  Larry S. Davis,et al.  Efficient Kernel Machines Using the Improved Fast Gauss Transform , 2004, NIPS.

[32]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[33]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[34]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[35]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[36]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[37]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[38]  Larry Wasserman,et al.  All of Nonparametric Statistics (Springer Texts in Statistics) , 2006 .

[39]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[40]  M. Nussbaum Minimax Risk, Pinsker Bound for , 2006 .

[41]  Daniel Polani,et al.  Least Squares SVM for Least Squares TD Learning , 2006, ECAI.

[42]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[43]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[44]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[45]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[46]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[47]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[48]  Marek Petrik,et al.  An Analysis of Laplacian Methods for Value Function Approximation in MDPs , 2007, IJCAI.

[49]  M. Loth,et al.  Sparse Temporal Difference Learning Using LASSO , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[50]  Lihong Li,et al.  Analyzing feature generation for value-function approximation , 2007, ICML '07.

[51]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[52]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[53]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[54]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[55]  H. Triebel Theory of Function Spaces III , 2008 .

[56]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[57]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[58]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[59]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[60]  Shie Mannor,et al.  Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[61]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[62]  Shie Mannor,et al.  Regularized Fitted Q-iteration: Application to Planning , 2008, EWRL.

[63]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[64]  Yishay Mansour,et al.  Domain Adaptation: Learning Bounds and Algorithms , 2009, COLT.

[65]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[66]  B. Nadler,et al.  Semi-supervised learning with the graph Laplacian: the limit of infinite unlabelled data , 2009, NIPS 2009.

[67]  Csaba Szepesvári,et al.  Model Selection in Reinforcement Learning , 2011, Machine Learning.

[68]  Ronald Parr,et al.  Linear Complementarity for Regularized Policy Evaluation and Improvement , 2010, NIPS.

[69]  Bo Liu,et al.  Basis Construction from Power Series Expansions of Value Functions , 2010, NIPS.

[70]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[71]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[72]  S. Mahadevan,et al.  Basis construction and utilization for markov decision processes using graphs , 2010 .

[73]  Matthew W. Hoffman,et al.  Finite-Sample Analysis of Lasso-TD , 2011, ICML.

[74]  Matthew W. Hoffman,et al.  Regularized Least Squares Temporal Difference Learning with Nested ℓ2 and ℓ1 Penalization , 2011, EWRL.

[75]  Matthieu Geist,et al.  ℓ1-Penalized Projected Bellman Residual , 2011, EWRL.

[76]  André da Motta Salles Barreto,et al.  Reinforcement Learning using Kernel-Based Stochastic Factorization , 2011, NIPS.

[77]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[78]  Csaba Szepesvari,et al.  Regularization in reinforcement learning , 2011 .

[79]  George Konidaris,et al.  Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.

[80]  Amir Massoud Farahmand,et al.  Action-Gap Phenomenon in Reinforcement Learning , 2011, NIPS.

[81]  Alborz Geramifard,et al.  Online Discovery of Feature Dependencies , 2011, ICML.

[82]  Csaba Szepesvari,et al.  Regularized least-squares regression: Learning from a β-mixing sequence , 2012 .

[83]  André da Motta Salles Barreto,et al.  On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization , 2012, NIPS.

[84]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[85]  Bo Liu,et al.  Regularized Off-Policy TD-Learning , 2012, NIPS.

[86]  Ronald Parr,et al.  Greedy Algorithms for Sparse Reinforcement Learning , 2012, ICML.

[87]  Csaba Szepesvári,et al.  Statistical linear estimation with penalized estimators: an application to reinforcement learning , 2012, ICML.

[88]  Alessandro Lazaric,et al.  Finite-sample analysis of least-squares policy iteration , 2012, J. Mach. Learn. Res..

[89]  Doina Precup,et al.  Value Pursuit Iteration , 2012, NIPS.

[90]  Joelle Pineau,et al.  Bellman Error Based Feature Generation using Random Projections on Sparse Spaces , 2013, NIPS.

[91]  Alborz Geramifard,et al.  Batch-iFDD for Representation Expansion in Large MDPs , 2013, UAI.

[92]  Klaus Obermayer,et al.  Construction of approximation spaces for reinforcement learning , 2013, J. Mach. Learn. Res..

[93]  Zhiwei Qin,et al.  Sparse Reinforcement Learning via Convex Optimization , 2014, ICML.

[94]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[95]  Philip Bachman,et al.  Sample-based approximate regularization , 2014, ICML.

[96]  André da Motta Salles Barreto,et al.  Classification-Based Approximate Policy Iteration , 2015, IEEE Transactions on Automatic Control.

[97]  Mehryar Mohri,et al.  Adaptation Algorithm and Theory Based on Generalized Discrepancy , 2014, KDD.