Model Selection in Reinforcement Learning

We consider the challenge of automating parameter tuning in reinforcement learning. More specifically, we consider the batch (off-line, non-interactive) reinforcement learning setting and the problem of learning an action-value function with a small Bellman error. We propose a complexity regularization-based model selection algorithm and prove its adaptivity : the procedure is shown to perform almost as well as if the best parameter setting was known ahead of time. We also discuss other approaches to derive adaptive procedures in reinforcement learning.

[1]  M. Madkour Nonlinear Least Squares Algorithm , 1972 .

[2]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[3]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[4]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  Yuhong Yang MODEL SELECTION FOR NONPARAMETRIC REGRESSION , 1997 .

[8]  Dharmendra S. Modha,et al.  Memory-Universal Prediction of Stationary Random Processes , 1998, IEEE Trans. Inf. Theory.

[9]  Paul-Marie Samson,et al.  Concentration of measure inequalities for Markov chains and $\Phi$-mixing processes , 2000 .

[10]  Louis Wehenkel,et al.  Application of Reinforcement Learning to Electrical Power System Closed-Loop Emergency Control , 2000, PKDD.

[11]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[12]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[13]  Csaba Szepesvári,et al.  Efficient approximate planning in continuous space Markovian Decision Problems , 2001, AI Commun..

[14]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[15]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[16]  M. Wegkamp Model selection in nonparametric regression , 2003 .

[17]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[18]  Györfi László,et al.  The estimation problem of minimum mean squared error , 2003 .

[19]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[20]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[21]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[22]  Ron Meir,et al.  Nonparametric Time Series Prediction Through Adaptive Model Selection , 2000, Machine Learning.

[23]  G. Lugosi,et al.  Complexity regularization via localized random penalties , 2004, math/0410091.

[24]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[25]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[26]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[27]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[28]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[29]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[30]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[31]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[32]  Shimon Whiteson,et al.  Evolutionary Function Approximation for Reinforcement Learning , 2006, J. Mach. Learn. Res..

[33]  D. Hinkley Annals of Statistics , 2006 .

[34]  A. V. D. Vaart,et al.  Oracle inequalities for multi-fold cross validation , 2006 .

[35]  Shie Mannor,et al.  Automatic basis function construction for approximate dynamic programming and reinforcement learning , 2006, ICML.

[36]  Ambuj Tewari,et al.  Sample Complexity of Policy Search with Known Dynamics , 2006, NIPS.

[37]  Larry Wasserman,et al.  All of Nonparametric Statistics (Springer Texts in Statistics) , 2006 .

[38]  Csaba Szepesvári,et al.  Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[39]  Daniel Polani,et al.  Least Squares SVM for Least Squares TD Learning , 2006, ECAI.

[40]  A. Antos,et al.  Value-Iteration Based Fitted Policy Iteration: Learning with a Single Trajectory , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[41]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[42]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[43]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[44]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[45]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[46]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[47]  B. Schölkopf,et al.  Sample complexity of policy search with known dynamics , 2007 .

[48]  M. Loth,et al.  Sparse Temporal Difference Learning Using LASSO , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[49]  Lihong Li,et al.  Analyzing feature generation for value-function approximation , 2007, ICML '07.

[50]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[51]  Csaba Szepesvári,et al.  Empirical Bernstein stopping , 2008, ICML '08.

[52]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[53]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[54]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[55]  Shie Mannor,et al.  Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[56]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[57]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[58]  Csaba Szepesvári,et al.  Model-based and Model-free Reinforcement Learning for Visual Servoing , 2009, 2009 IEEE International Conference on Robotics and Automation.

[59]  B. Nadler,et al.  Semi-supervised learning with the graph Laplacian: the limit of infinite unlabelled data , 2009, NIPS 2009.

[60]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[61]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[62]  Csaba Szepesvári,et al.  Reinforcement Learning Algorithms for MDPs , 2011 .

[63]  Csaba Szepesvari,et al.  Regularized least-squares regression: Learning from a β-mixing sequence , 2012 .