Lipschitz Continuity in Model-based Reinforcement Learning

Model-based reinforcement-learning methods learn transition and reward models and use them to guide behavior. We analyze the impact of learning models that are Lipschitz continuous---the distance between function values for two inputs is bounded by a linear function of the distance between the inputs. Our first result shows a tight bound on model errors for multi-step predictions with Lipschitz continuous models. We go on to prove an error bound for the value-function estimate arising from such models and show that the estimated value function is itself Lipschitz continuous. We conclude with empirical results that demonstrate significant benefits to enforcing Lipschitz continuity of neural net models during reinforcement learning.

[1]  Alfred Müller,et al.  Optimal selection from distributions with unknown parameters: Robustness of Bayesian models , 1996, Math. Methods Oper. Res..

[2]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[3]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[5]  Kavosh Asadi,et al.  An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  C. Villani Optimal Transport: Old and New , 2008 .

[8]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[10]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[11]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[12]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[13]  Hariharan Narayanan,et al.  Sample Complexity of Testing the Manifold Hypothesis , 2010, NIPS.

[14]  Minwoo Lee,et al.  Faster reinforcement learning after pretraining deep networks to predict state dynamics , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[15]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[16]  D. Bertsekas Convergence of discretization procedures in dynamic programming , 1975 .

[17]  Michail G. Lagoudakis,et al.  On the locality of action domination in sequential decision making , 2010, ISAIM.

[18]  Catholijn M. Jonker,et al.  Learning Multimodal Transition Dynamics for Model-Based Reinforcement Learning , 2017, ArXiv.

[19]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[20]  Kavosh Asadi,et al.  Equivalence Between Wasserstein and Value-Aware Model-based Reinforcement Learning , 2018, ArXiv.

[21]  Daniel Nikovski,et al.  Value-Aware Loss Function for Model-based Reinforcement Learning , 2017, AISTATS.

[22]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[23]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[24]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[25]  R. Bellman A Markovian Decision Process , 1957 .

[26]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[27]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[28]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[29]  Luca Bascetta,et al.  Policy gradient in Lipschitz Markov Decision Processes , 2015, Machine Learning.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[32]  Erik Talvitie,et al.  Self-Correcting Models for Model-Based Reinforcement Learning , 2016, AAAI.

[33]  D. Freedman,et al.  Some Asymptotic Theory for the Bootstrap , 1981 .

[34]  Martial Hebert,et al.  Improving Multi-Step Prediction of Learned Time Series Models , 2015, AAAI.

[35]  Doina Precup,et al.  Metrics for Finite Markov Decision Processes , 2004, AAAI.

[36]  Erik Talvitie,et al.  Model Regularization for Stable Sample Rollouts , 2014, UAI.

[37]  Maneesh Kumar Singh,et al.  Lipschitz Properties for Deep Convolutional Networks , 2017, ArXiv.

[38]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[39]  Jason Pazis,et al.  PAC Optimal Exploration in Continuous Space Markov Decision Processes , 2013, AAAI.

[40]  Jürgen Schmidhuber,et al.  Learning to Generate Artificial Fovea Trajectories for Target Detection , 1991, Int. J. Neural Syst..

[41]  Lacra Pavel,et al.  On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning , 2017, ArXiv.

[42]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[43]  Andrew W. Moore,et al.  Locally Weighted Learning for Control , 1997, Artificial Intelligence Review.

[44]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[45]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[46]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[47]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[48]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[49]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[50]  Karl Hinderer,et al.  Lipschitz Continuity of Value Functions in Markovian Decision Processes , 2005, Math. Methods Oper. Res..

[51]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[52]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[53]  Kavosh Asadi,et al.  Equivalence Between Wasserstein and Value-Aware Model-based Reinforcement Learning , 2018, ArXiv.

[54]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[55]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[56]  Bernardo Ávila Pires,et al.  Policy Error Bounds for Model-Based Reinforcement Learning with Factored Linear Models , 2016, COLT.

[57]  Csaba Szepesvári,et al.  –armed Bandits , 2022 .

[58]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.