Online model-learning algorithm from samples and trajectories

Learning of the value function and the policy for continuous MDPs is non-trial due to the difficulty in collecting enough data. Model learning can use the collected data effectively, to learn a model and then use the learned model for planning, so as to accelerate the learning of the value function and the policy. Most of the existing works about model learning only concern the improvement of the single-step or multiple-step prediction, while the combination of them may be a better choice. Therefore, we propose an online algorithm where the samples for learning the model are both from the samples and from the trajectories, called Online-ML-ST. Other than the existing work, the trajectories collected in the interaction with the environment are not only used to learn the model offline, but also to learn the model, the value function and the policy online. The experiments are implemented in two typical continuous benchmarks such as the Pole Balancing and Inverted Pendulum, and the result shows that Online-ML-ST outperforms the other three typical methods in learning rate and convergence rate.

[1]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[2]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[3]  Fei Hu,et al.  Intelligent Spectrum Management Based on Transfer Actor-Critic Learning for Rateless Transmissions in Cognitive Radio Networks , 2018, IEEE Transactions on Mobile Computing.

[4]  Saeed Khodaygan,et al.  Optimal path-planning for mobile robots to find a hidden target in an unknown environment based on machine learning , 2019, J. Ambient Intell. Humaniz. Comput..

[5]  Quan Liu,et al.  Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation , 2018, Frontiers of Computer Science.

[6]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[7]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[8]  Guillaume Lample,et al.  Playing FPS Games with Deep Reinforcement Learning , 2016, AAAI.

[9]  Zhao Li,et al.  Improving selection strategies in zeroth-level classifier systems based on average reward reinforcement learning , 2018 .

[10]  Jing Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[11]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Robert Babuska,et al.  Efficient Model Learning Methods for Actor–Critic Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[14]  Michael L. Littman,et al.  Reinforcement learning improves behaviour from evaluative feedback , 2015, Nature.

[15]  Martial Hebert,et al.  Improved Learning of Dynamics Models for Control , 2016, ISER.

[16]  Robert Babuska,et al.  Model learning actor-critic algorithms: Performance evaluation in a motion control task , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[17]  Roland Siegwart,et al.  Control of a Quadrotor With Reinforcement Learning , 2017, IEEE Robotics and Automation Letters.

[18]  Martial Hebert,et al.  Improving Multi-Step Prediction of Learned Time Series Models , 2015, AAAI.

[19]  Dazi Li,et al.  Sustainable ℓ2-regularized actor-critic based on recursive least-squares temporal difference learning , 2017, 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[20]  Qinglai Wei,et al.  Data-Driven Zero-Sum Neuro-Optimal Control for a Class of Continuous-Time Unknown Nonlinear Systems With Disturbance Using ADP , 2016, IEEE Transactions on Neural Networks and Learning Systems.