论文信息 - Regularized Policy Iteration with Nonparametric Function Spaces

Regularized Policy Iteration with Nonparametric Function Spaces

We study two regularization-based approximate policy iteration algorithms, namely REG-LSPI and REG-BRM, to solve reinforcement learning and planning problems in discounted Markov Decision Processes with large state and finite action spaces. The core of these algorithms are the regularized extensions of the Least-Squares Temporal Difference (LSTD) learning and Bellman Residual Minimization (BRM), which are used in the algorithms' policy evaluation steps. Regularization provides a convenient way to control the complexity of the function space to which the estimated value function belongs and as a result enables us to work with rich nonparametric function spaces. We derive efficient implementations of our methods when the function space is a reproducing kernel Hilbert space. We analyze the statistical properties of REG-LSPI and provide an upper bound on the policy evaluation error and the performance loss of the policy returned by this method. Our bound shows the dependence of the loss on the number of samples, the capacity of the function space, and some intrinsic properties of the underlying Markov Decision Process. The dependence of the policy evaluation bound on the number of samples is minimax optimal. This is the first work that provides such a strong guarantee for a nonparametric approximate policy iteration algorithm.

[1] C. J. Stone,et al. Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[2] M. Nussbaum. Spline Smoothing in Regression Models and Asymptotic Efficiency in $L_2$ , 1985 .

[3] P. Schweitzer,et al. Generalized polynomial approximations in Markovian decision processes , 1985 .

[4] M. Nussbaum,et al. A Risk Bound in Sobolev Class Regression , 1990 .

[5] Y. C. Pati,et al. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[6] Stéphane Mallat,et al. Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[7] Ronald J. Williams,et al. Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[8] P. Doukhan. Mixing: Properties and Examples , 1994 .

[9] Bin Yu. RATES OF CONVERGENCE FOR EMPIRICAL PROCESSES OF STATIONARY MIXING SEQUENCES , 1994 .

[10] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[11] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[12] R. Beatson,et al. A short course on fast multipole methods , 1997 .

[13] R. DeVore,et al. Nonlinear approximation , 1998, Acta Numerica.

[14] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[15] Yuhong Yang,et al. Information-theoretic determination of minimax rates of convergence , 1999 .

[16] Paul-Marie Samson,et al. Concentration of measure inequalities for Markov chains and $\Phi$-mixing processes , 2000 .

[17] M. Kohler. Inequalities for uniform deviations of averages from expectations with applications to nonparametric regression , 2000 .

[18] S. Geer. Empirical Processes in M-Estimation , 2000 .

[19] A. E. Hoerl,et al. Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[20] Tomaso A. Poggio,et al. Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[21] Bernhard Schölkopf,et al. A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[22] S. R. Jammalamadaka,et al. Empirical Processes in M-Estimation , 2001 .

[23] Ding-Xuan Zhou,et al. The covering number in learning theory , 2002, J. Complex..

[24] Tong Zhang,et al. Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[25] Adam Krzyzak,et al. A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[26] S. Smale,et al. ESTIMATING THE APPROXIMATION ERROR IN LEARNING THEORY , 2003 .

[27] Rémi Munos,et al. Error Bounds for Approximate Policy Iteration , 2003, ICML.

[28] Ding-Xuan Zhou,et al. Capacity of reproducing kernel spaces in learning theory , 2003, IEEE Transactions on Information Theory.

[29] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[30] R. Tibshirani,et al. Least angle regression , 2004, math/0406456.

[31] Larry S. Davis,et al. Efficient Kernel Machines Using the Improved Fast Gauss Transform , 2004, NIPS.

[32] Steven J. Bradtke,et al. Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[33] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[34] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[35] Shie Mannor,et al. Reinforcement learning with Gaussian processes , 2005, ICML.

[36] Mikhail Belkin,et al. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[37] Liming Xiang,et al. Kernel-Based Reinforcement Learning , 2006, ICIC.

[38] Larry Wasserman,et al. All of Nonparametric Statistics (Springer Texts in Statistics) , 2006 .

[39] Koby Crammer,et al. Analysis of Representations for Domain Adaptation , 2006, NIPS.

[40] M. Nussbaum. Minimax Risk, Pinsker Bound for , 2006 .

[41] Daniel Polani,et al. Least Squares SVM for Least Squares TD Learning , 2006, ECAI.

[42] Stergios B. Fotopoulos,et al. All of Nonparametric Statistics , 2007, Technometrics.

[43] Xin Xu,et al. Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[44] Csaba Szepesvári,et al. Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[45] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.

[46] Dimitri P. Bertsekas,et al. Stochastic optimal control : the discrete time case , 2007 .

[47] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[48] Marek Petrik,et al. An Analysis of Laplacian Methods for Value Function Approximation in MDPs , 2007, IJCAI.

[49] M. Loth,et al. Sparse Temporal Difference Learning Using LASSO , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[50] Lihong Li,et al. Analyzing feature generation for value-function approximation , 2007, ICML '07.

[51] Rémi Munos,et al. Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[52] Sridhar Mahadevan,et al. Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[53] Andreas Christmann,et al. Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[54] Nathan Srebro,et al. SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[55] H. Triebel. Theory of Function Spaces III , 2008 .

[56] Shie Mannor,et al. Regularized Policy Iteration , 2008, NIPS.

[57] Don R. Hush,et al. Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[58] Gavin Taylor,et al. Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[59] Alexandre B. Tsybakov,et al. Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[60] Shie Mannor,et al. Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[61] Andrew Y. Ng,et al. Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.