Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning

The choice of the control frequency of a system has a relevant impact on the ability of reinforcement learning algorithms to learn a highly performing policy. In this paper, we introduce the notion of action persistence that consists in the repetition of an action for a fixed number of decision steps, having the effect of modifying the control frequency. We start analyzing how action persistence affects the performance of the optimal policy, and then we present a novel algorithm, Persistent Fitted Q-Iteration (PFQI), that extends FQI, with the goal of learning the optimal value function at a given persistence. After having provided a theoretical study of PFQI and a heuristic approach to identify the optimal persistence, we present an experimental campaign on benchmark domains to show the advantages of action persistence and proving the effectiveness of our persistence selection method.

[1]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[2]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[3]  Alborz Geramifard,et al.  RLPy: a value-function-based reinforcement learning framework for education and research , 2015, J. Mach. Learn. Res..

[4]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[5]  Csaba Szepesvari,et al.  Regularization in reinforcement learning , 2011 .

[6]  J. Peterson On-Line Estimation of the Optimal Value Function: HJB- Estimators , 1992, NIPS 1992.

[7]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[8]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[9]  Yann Ollivier,et al.  Making Deep Q-learning methods robust to time discretization , 2019, ICML.

[10]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[11]  Daniel Nikovski,et al.  Truncated Approximate Dynamic Programming with Task-Dependent Terminal Value , 2016, AAAI.

[12]  Jan Peters,et al.  Learning Motor Skills - From Algorithms to Robot Experiments , 2013, Springer Tracts in Advanced Robotics.

[13]  Dimitri P. Bertsekas,et al.  Dynamic programming and optimal control, 3rd Edition , 2005 .

[14]  Csaba Szepesvári,et al.  Model Selection in Reinforcement Learning , 2011, Machine Learning.

[15]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[16]  C. Villani Optimal Transport: Old and New , 2008 .

[17]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[18]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[21]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[22]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[23]  Marcello Restelli,et al.  Configurable Markov Decision Processes , 2018, ICML.

[24]  Marek Petrik,et al.  Biasing Approximate Dynamic Programming with a Lower Discount Factor , 2008, NIPS.

[25]  M. Baykal-Gursoy,et al.  SEMI-MARKOV DECISION PROCESSES , 2022 .

[26]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[27]  L. Györfi,et al.  A Distribution-Free Theory of Nonparametric Regression (Springer Series in Statistics) , 2002 .

[28]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29]  Balaraman Ravindran,et al.  Dynamic Action Repetition for Deep Reinforcement Learning , 2017, AAAI.

[30]  Rémi Coulom,et al.  Reinforcement Learning Using Neural Networks, with Applications to Motor Control. (Apprentissage par renforcement utilisant des réseaux de neurones, avec des applications au contrôle moteur) , 2002 .

[31]  Yang Hu,et al.  Reinforcement learning for robotic manipulation using simulated locomotion demonstrations , 2019, Machine Learning.

[32]  Marcello Restelli,et al.  Reinforcement Learning in Configurable Continuous Environments , 2019, ICML.

[33]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[34]  Satinder P. Singh,et al.  Reinforcement Learning with a Hierarchy of Abstract Models , 1992, AAAI.

[35]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[36]  Michail G. Lagoudakis,et al.  On the locality of action domination in sequential decision making , 2010, ISAIM.

[37]  C. Karen Liu,et al.  Assistive Gym: A Physics Simulation Framework for Assistive Robotics , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[38]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[39]  W. Fleming,et al.  Controlled Markov processes and viscosity solutions , 1992 .

[40]  Satinder P. Singh,et al.  Scaling Reinforcement Learning Algorithms by Learning Variable Temporal Resolution Models , 1992, ML.

[41]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[42]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[43]  Luca Bascetta,et al.  Policy gradient in Lipschitz Markov Decision Processes , 2015, Machine Learning.

[44]  L. C. Baird,et al.  Reinforcement learning in continuous time: advantage updating , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[45]  K. Gürsoy,et al.  SEMI-MARKOV DECISION PROCESSES , 2007, Probability in the Engineering and Informational Sciences.

[46]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[47]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[48]  Rémi Munos,et al.  A Convergent Reinforcement Learning Algorithm in the Continuous Case Based on a Finite Difference Method , 1997, IJCAI.

[49]  Peter Dayan,et al.  Improving Policies without Measuring Merits , 1995, NIPS.

[50]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[51]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[52]  F. Fairman Introduction to dynamic systems: Theory, models and applications , 1979, Proceedings of the IEEE.

[53]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[54]  Rémi Munos,et al.  Reinforcement Learning for Continuous Stochastic Control Problems , 1997, NIPS.

[55]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[56]  Shie Mannor,et al.  Approximate Value Iteration with Temporally Extended Actions , 2015, J. Artif. Intell. Res..

[57]  Sergey Levine,et al.  Learning to Walk via Deep Reinforcement Learning , 2018, Robotics: Science and Systems.

[58]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[59]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.