Value-Aware Loss Function for Model-based Reinforcement Learning

We consider the problem of estimating the transition probability kernel to be used by a model-based reinforcement learning (RL) algorithm. We argue that estimating a generative model that minimizes a probabilistic loss, such as the log-loss, is an overkill because it does not take into account the underlying structure of decision problem and the RL algorithm that intends to solve it. We introduce a loss function that takes the structure of the value function into account. We provide a finite-sample upper bound for the loss function showing the dependence of the error on model approximation error, number of samples, and the complexity of the model space. We also empirically compare the method with the maximum likelihood estimator on a simple problem. Proceedings of the 20 International Conference on Artificial Intelligence and Statistics (AISTATS) 2017, Fort Lauderdale, Florida, USA. JMLR: W&CP volume 54. Copyright 2017 by the author(s). Value-Aware Loss Function for Model-based Reinforcement Learning

[1]  S. Geer Empirical Processes in M-Estimation , 2000 .

[2]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[3]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[4]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[5]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[6]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[8]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[9]  Marek Petrik,et al.  An Analysis of Laplacian Methods for Value Function Approximation in MDPs , 2007, IJCAI.

[10]  Richard S. Sutton,et al.  Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.

[11]  Lihong Li,et al.  Analyzing feature generation for value-function approximation , 2007, ICML '07.

[12]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[13]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[14]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[15]  Csaba Szepesvári,et al.  Model-based and Model-free Reinforcement Learning for Visual Servoing , 2009, 2009 IEEE International Conference on Robotics and Automation.

[16]  Bo Liu,et al.  Basis Construction from Power Series Expansions of Value Functions , 2010, NIPS.

[17]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[18]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[19]  André da Motta Salles Barreto,et al.  Reinforcement Learning using Kernel-Based Stochastic Factorization , 2011, NIPS.

[20]  Amir Massoud Farahmand,et al.  Action-Gap Phenomenon in Reinforcement Learning , 2011, NIPS.

[21]  Alborz Geramifard,et al.  Online Discovery of Feature Dependencies , 2011, ICML.

[22]  Peter Stone,et al.  TEXPLORE: real-time sample-efficient reinforcement learning for robots , 2012, Machine Learning.

[23]  Guy Lever,et al.  Modelling transition dynamics in MDPs with RKHS embeddings , 2012, ICML.

[24]  Doina Precup,et al.  Value Pursuit Iteration , 2012, NIPS.

[25]  Joelle Pineau,et al.  Bellman Error Based Feature Generation using Random Projections on Sparse Spaces , 2013, NIPS.

[26]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[27]  Klaus Obermayer,et al.  Construction of approximation spaces for reinforcement learning , 2013, J. Mach. Learn. Res..

[28]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[29]  Xinhua Zhang,et al.  Pseudo-MDPs and factored linear action models , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[30]  J. Andrew Bagnell,et al.  Approximate MaxEnt Inverse Optimal Control and Its Application for Mental Simulation of Human Interactions , 2015, AAAI.

[31]  R. Nickl,et al.  Mathematical Foundations of Infinite-Dimensional Statistical Models , 2015 .

[32]  Carl E. Rasmussen,et al.  Gaussian Processes for Data-Efficient Learning in Robotics and Control , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[35]  Value-Aware Loss Function for Model Learning in Reinforcement Learning , 2016 .

[36]  Marlos C. Machado,et al.  State of the Art Control of Atari Games Using Shallow Reinforcement Learning , 2015, AAMAS.

[37]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[38]  John Shawe-Taylor,et al.  Compressed Conditional Mean Embeddings for Model-Based Reinforcement Learning , 2016, AAAI.

[39]  Bernardo Ávila Pires,et al.  Policy Error Bounds for Model-Based Reinforcement Learning with Factored Linear Models , 2016, COLT.