论文信息 - Boosted Fitted Q-Iteration

Boosted Fitted Q-Iteration

This paper is about the study of B-FQI, an Approximated Value Iteration (AVI) algorithm that exploits a boosting procedure to estimate the action-value function in reinforcement learning problems. B-FQI is an iterative off-line algorithm that, given a dataset of transitions, builds an approximation of the optimal action-value function by summing the approximations of the Bell-man residuals across all iterations. The advantage of such approach w.r.t. to other AVI methods is twofold: (1) while keeping the same function space at each iteration, B-FQI can represent more complex functions by considering an additive model; (2) since the Bellman residual decreases as the optimal value function is approached , regression problems become easier as iterations proceed. We study B-FQI both theoretically , providing also a finite-sample error upper bound for it, and empirically, by comparing its performance to the one of FQI in different domains and using different regression techniques.

[1] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2] R. Tibshirani,et al. Generalized Additive Models , 1991 .

[3] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[4] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[5] Matthew Saffell,et al. Reinforcement Learning for Trading , 1998, NIPS.

[6] Preben Alstrøm,et al. Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[7] Kenji Doya,et al. Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[8] J. Wellner,et al. Preservation Theorems for Glivenko-Cantelli and Uniform Glivenko-Cantelli Classes , 2000 .

[9] J. Friedman. Greedy function approximation: A gradient boosting machine. , 2001 .

[10] L. Györfi,et al. A Distribution-Free Theory of Nonparametric Regression (Springer Series in Statistics) , 2002 .

[11] Adam Krzyzak,et al. A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[12] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[13] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[14] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15] Pierre Geurts,et al. Extremely randomized trees , 2006, Machine Learning.

[16] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[17] Lihong Li,et al. Analyzing feature generation for value-function approximation , 2007, ICML '07.

[18] Sridhar Mahadevan,et al. Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[19] Peter Buhlmann,et al. BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[20] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[21] Shie Mannor,et al. Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[22] Alessandro Lazaric,et al. Finite-sample Analysis of Bellman Residual Minimization , 2010, ACML.

[23] Csaba Szepesvári,et al. Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[24] Csaba Szepesvari,et al. Regularization in reinforcement learning , 2011 .

[25] Doina Precup,et al. Value Pursuit Iteration , 2012, NIPS.

[26] Joelle Pineau,et al. Bellman Error Based Feature Generation using Random Projections on Sparse Spaces , 2013, NIPS.

[27] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[28] Matthieu Geist,et al. Boosted Bellman Residual Minimization Handling Expert Demonstrations , 2014, ECML/PKDD.

[29] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[30] Fernando Diaz,et al. Exploratory Gradient Boosting for Reinforcement Learning in Complex Domains , 2016, ArXiv.

[31] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[32] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.