Online fitted policy iteration based on extreme learning machines

Reinforcement learning (RL) is a learning paradigm that can be useful in a wide variety of real-world applications. However, its applicability to complex problems remains problematic due to different causes. Particularly important among these are the high quantity of data required by the agent to learn useful policies and the poor scalability to high-dimensional problems due to the use of local approximators. This paper presents a novel RL algorithm, called online fitted policy iteration (OFPI), that steps forward in both directions. OFPI is based on a semi-batch scheme that increases the convergence speed by reusing data and enables the use of global approximators by reformulating the value function approximation as a standard supervised problem. The proposed method has been empirically evaluated in three benchmark problems. During the experiments, OFPI has employed a neural network trained with the extreme learning machine algorithm to approximate the value functions. Results have demonstrated the stability of OFPI using a global function approximator and also performance improvements over two baseline algorithms (SARSA and Q-learning) combined with eligibility traces and a radial basis function network.

[1]  Amaury Lendasse,et al.  OP-ELM: Optimally Pruned Extreme Learning Machine , 2010, IEEE Transactions on Neural Networks.

[2]  Jürgen Schmidhuber,et al.  A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients , 2009, Expert Syst. Appl..

[3]  Sundaram Suresh,et al.  Stable indirect adaptive neural controller for a class of nonlinear system , 2011, Neurocomputing.

[4]  Peter Stone,et al.  Batch reinforcement learning in a complex domain , 2007, AAMAS '07.

[5]  Jing Wang,et al.  Pornographic images recognition based on spatial pyramid partition and multi-instance ensemble learning , 2015, Knowl. Based Syst..

[6]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[7]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[8]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[9]  Dan Xia,et al.  Learning classifier system with average reward reinforcement learning , 2013, Knowl. Based Syst..

[10]  Félix Hernández-del-Olmo,et al.  Autonomous Adaptive and Active Tuning Up of the Dissolved Oxygen Setpoint in a Wastewater Treatment Plant Using Reinforcement Learning , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[11]  José David Martín-Guerrero,et al.  Least-squares temporal difference learning based on extreme learning machine , 2014, ESANN.

[12]  Antonio J. Serrano,et al.  Regularized Committee of Extreme Learning Machine for Regression Problems , 2012, ESANN.

[13]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[14]  Richard S. Sutton,et al.  True online TD(λ) , 2014, ICML 2014.

[15]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[16]  Nuria Aleixos,et al.  Selection of Optimal Wavelength Features for Decay Detection in Citrus Fruit Using the ROC Curve and Neural Networks , 2013, Food and Bioprocess Technology.

[17]  Xiaogang Ruan,et al.  On-line adaptive control for inverted pendulum balancing based on feedback-error-learning , 2007, Neurocomputing.

[18]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[19]  Kazuo Tanaka,et al.  An approach to fuzzy control of nonlinear systems: stability and design issues , 1996, IEEE Trans. Fuzzy Syst..

[20]  A. Antos,et al.  Value-Iteration Based Fitted Policy Iteration: Learning with a Single Trajectory , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[21]  Fang Liu,et al.  Joint sparse regularization based Sparse Semi-Supervised Extreme Learning Machine (S3ELM) for classification , 2015, Knowl. Based Syst..

[22]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[23]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[24]  José David Martín-Guerrero,et al.  Regularized extreme learning machine for regression problems , 2011, Neurocomputing.

[25]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[26]  Kamal Jamshidi,et al.  Biologically inspired layered learning in humanoid robots , 2014, Knowl. Based Syst..

[27]  Andrew G. Barto,et al.  Transfer in Reinforcement Learning via Shared Features , 2012, J. Mach. Learn. Res..

[28]  José del R. Millán,et al.  Continuous-Action Q-Learning , 2002, Machine Learning.

[29]  Cengiz Kahraman,et al.  Analysis of cross-price effects on markdown policies by using function approximation techniques , 2013, Knowl. Based Syst..

[30]  Dirk Ormoneit,et al.  Kernel-Based Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[31]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[32]  Paul Wagner,et al.  A Gaussian Process Reinforcement Learning Algorithm with Adaptability and Minimal Tuning Requirements , 2014, ICANN.

[33]  Antonio J. Serrano,et al.  BELM: Bayesian Extreme Learning Machine , 2011, IEEE Transactions on Neural Networks.

[34]  Martin A. Riedmiller Concepts and Facilities of a Neural Reinforcement Learning Control Architecture for Technical Process Control , 1999, Neural Computing & Applications.

[35]  José David Martín-Guerrero,et al.  Optimization of anemia treatment in hemodialysis patients via reinforcement learning , 2014, Artif. Intell. Medicine.

[36]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[37]  Marco Wiering,et al.  Reinforcement Learning , 2014, Adaptation, Learning, and Optimization.

[38]  Grigorios Tsoumakas,et al.  Transferring task models in Reinforcement Learning agents , 2013, Neurocomputing.

[39]  Carl E. Rasmussen,et al.  Gaussian process dynamic programming , 2009, Neurocomputing.

[40]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[41]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[42]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[43]  Martin A. Riedmiller,et al.  Reinforcement learning in feedback control , 2011, Machine Learning.

[44]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[45]  Vijaykumar Gullapalli,et al.  Direct associative reinforcement learning methods for dynamic systems control , 1995, Neurocomputing.

[46]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[47]  Amaury Lendasse,et al.  Adaptive Ensemble Models of Extreme Learning Machines for Time Series Prediction , 2009, ICANN.

[48]  H. JoséAntonioMartín,et al.  Dyna-H: A heuristic planning reinforcement learning algorithm applied to role-playing game strategy decision systems , 2011, Knowl. Based Syst..

[49]  Oriol Sallent,et al.  An Application of Reinforcement Learning for Efficient Spectrum Usage in Next-Generation Mobile Cellular Networks , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[50]  Robert Babuska,et al.  Experience Replay for Real-Time Reinforcement Learning Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[51]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.