Classification-based Approximate Policy Iteration: Experiments and Extended Discussions

Tackling large approximate dynamic programming or reinforcement learning problems requires methods that can exploit regularities, or intrinsic structure, of the problem in hand. Most current methods are geared towards exploiting the regularities of either the value function or the policy. We introduce a general classification-based approximate policy iteration (CAPI) framework, which encompasses a large class of algorithms that can exploit regularities of both the value function and the policy space, depending on what is advantageous. This framework has two main components: a generic value function estimator and a classifier that learns a policy based on the estimated value function. We establish theoretical guarantees for the sample complexity of CAPI-style algorithms, which allow the policy evaluation step to be performed by a wide variety of algorithms (including temporal-difference-style methods), and can handle nonparametric representations of policies. Our bounds on the estimation error of the performance loss are tighter than existing results. We also illustrate this approach empirically on several problems, including a large HIV control task.

[1]  M. Drouillon,et al.  A. M. A. , 2019, California state journal of medicine.

[2]  A. P. Wieland,et al.  Evolving neural network controllers for unstable systems , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[3]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[4]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[7]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[8]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[9]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[10]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[11]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[12]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[13]  Robert Givan,et al.  Approximate Policy Iteration with a Policy Language Bias , 2003, NIPS.

[14]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[15]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[16]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[17]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[18]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[19]  Shie Mannor,et al.  The kernel recursive least-squares algorithm , 2004, IEEE Transactions on Signal Processing.

[20]  B. Adams,et al.  Dynamic multidrug therapies for hiv: optimal and sti control approaches. , 2004, Mathematical biosciences and engineering : MBE.

[21]  Philip D. Plowright,et al.  Convexity , 2019, Optimization for Chemical and Biochemical Engineering.

[22]  Xi-Ren Cao,et al.  A basic formula for online policy gradient algorithms , 2005, IEEE Transactions on Automatic Control.

[23]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[24]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[25]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Suboptimal Control: A Survey from ADP to MPC , 2005, Eur. J. Control.

[26]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[27]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[28]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[29]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[30]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[31]  Louis Wehenkel,et al.  Clinical data based optimal STI strategies for HIV: a reinforcement learning approach , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[32]  Mohammad Ghavamzadeh,et al.  Bayesian Policy Gradient Algorithms , 2006, NIPS.

[33]  Daniel Polani,et al.  Least Squares SVM for Least Squares TD Learning , 2006, ECAI.

[34]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[35]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[36]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[37]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[38]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2007, 0708.2321.

[39]  Vadim Bulitko,et al.  Focus of Attention in Reinforcement Learning , 2007, J. Univers. Comput. Sci..

[40]  Lihong Li,et al.  Analyzing feature generation for value-function approximation , 2007, ICML '07.

[41]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[42]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[43]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[44]  Christos Dimitrakakis,et al.  Algorithms and Bounds for Rollout Sampling Approximate Policy Iteration , 2008, EWRL.

[45]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[46]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[47]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[48]  Shie Mannor,et al.  Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[49]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[50]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[51]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[52]  Csaba Szepesvári,et al.  Model Selection in Reinforcement Learning , 2011, Machine Learning.

[53]  Ronald Parr,et al.  Linear Complementarity for Regularized Policy Evaluation and Improvement , 2010, NIPS.

[54]  Bart De Schutter,et al.  Approximate dynamic programming with a fuzzy parameterization , 2010, Autom..

[55]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[56]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[57]  Matthew W. Hoffman,et al.  Finite-Sample Analysis of Lasso-TD , 2011, ICML.

[58]  Matthieu Geist,et al.  Parametric value function approximation: A unified view , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[59]  Matthew W. Hoffman,et al.  Regularized Least Squares Temporal Difference Learning with Nested ℓ2 and ℓ1 Penalization , 2011, EWRL.

[60]  Matthieu Geist,et al.  ℓ1-Penalized Projected Bellman Residual , 2011, EWRL.

[61]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[62]  Csaba Szepesvari,et al.  Regularization in reinforcement learning , 2011 .

[63]  Bruno Scherrer,et al.  Classification-based Policy Iteration with a Critic , 2011, ICML.

[64]  Amir Massoud Farahmand,et al.  Action-Gap Phenomenon in Reinforcement Learning , 2011, NIPS.

[65]  Alborz Geramifard,et al.  Online Discovery of Feature Dependencies , 2011, ICML.

[66]  Matthieu Geist,et al.  Approximate Modified Policy Iteration , 2012, ICML.

[67]  Alessandro Lazaric,et al.  Conservative and Greedy Approaches to Classification-Based Policy Iteration , 2012, AAAI.

[68]  Doina Precup,et al.  Generalized Classication-bas ed Approximate Policy Iteration , 2012 .

[69]  Csaba Szepesvári,et al.  Statistical linear estimation with penalized estimators: an application to reinforcement learning , 2012, ICML.

[70]  Alessandro Lazaric,et al.  Finite-sample analysis of least-squares policy iteration , 2012, J. Mach. Learn. Res..

[71]  Doina Precup,et al.  Value Pursuit Iteration , 2012, NIPS.

[72]  Bruno Scherrer,et al.  On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes , 2012, NIPS.

[73]  Joelle Pineau,et al.  Bellman Error Based Feature Generation using Random Projections on Sparse Spaces , 2013, NIPS.

[74]  Amir-massoud Farahmand CAPI : Generalized Classification-based Approximate Policy Iteration , 2013 .

[75]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[76]  André da Motta Salles Barreto,et al.  Classification-Based Approximate Policy Iteration , 2015, IEEE Transactions on Automatic Control.