Incorporating Domain Models into Bayesian Optimization for RL

In many Reinforcement Learning (RL) domains there is a high cost for generating experience in order to evaluate an agent's performance. An appealing approach to reducing the number of expensive evaluations is Bayesian Optimization (BO), which is a framework for global optimization of noisy and costly to evaluate functions. Prior work in a number of RL domains has demonstrated the effectiveness of BO for optimizing parametric policies. However, those approaches completely ignore the state-transition sequence of policy executions and only consider the total reward achieved. In this paper, we study how to more effectively incorporate all of the information observed during policy executions into the BO framework. In particular, our approach uses the observed data to learn approximate transitions models that allow for Monte-Carlo predictions of policy returns. The models are then incorporated into the BO framework as a type of prior on policy returns, which can better inform the BO process. The resulting algorithm provides a new approach for leveraging learned models in RL even when there is no planner available for exploiting those models. We demonstrate the effectiveness of our algorithm in four benchmark domains, which have dynamics of variable complexity. Results indicate that our algorithm effectively combines model based predictions to improve the data efficiency of model free BO methods, and is robust to modeling errors when parts of the domain cannot be modeled successfully.

[1]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[2]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[3]  C. D. Perttunen,et al.  Lipschitzian optimization without the Lipschitz constant , 1993 .

[4]  Jonas Mockus,et al.  Application of Bayesian approach to numerical methods of global and stochastic optimization , 1994, J. Glob. Optim..

[5]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[6]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[7]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[8]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[9]  Nello Cristianini,et al.  Composite Kernels for Hypertext Categorisation , 2001, ICML.

[10]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[11]  Naonori Ueda,et al.  Single-shot detection of multiple categories of text using parametric mixture models , 2002, KDD.

[12]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[13]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[14]  Michael O. Duff,et al.  Design for an Optimal Probe , 2003, ICML.

[15]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[16]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[17]  Thomas Hofmann,et al.  Semi-supervised Learning on Directed Graphs , 2004, NIPS.

[18]  Volker Tresp,et al.  Multi-label informed latent semantic indexing , 2005, SIGIR '05.

[19]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[20]  Marina Meila,et al.  Spectral Clustering of Biological Sequence Data , 2005, AAAI.

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[23]  Bernhard Schölkopf,et al.  Learning from labeled and unlabeled data on a directed graph , 2005, ICML.

[24]  Matthias Hein,et al.  Manifold Denoising , 2006, NIPS.

[25]  Gunnar Rätsch,et al.  Graph Based Semi-supervised Learning with Sharper Edges , 2006, ECML.

[26]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[27]  Rajeev Sangal,et al.  Proceedings of the 20th international joint conference on Artifical intelligence , 2007 .

[28]  Yihong Gong,et al.  Combining content and link for classification using matrix factorization , 2007, SIGIR.

[29]  Marina Meila,et al.  Clustering by weighted cuts in directed graphs , 2007, SDM.

[30]  Tao Wang,et al.  Automatic Gait Optimization with Gaussian Process Regression , 2007, IJCAI.

[31]  Grigorios Tsoumakas,et al.  Multi-Label Classification of Music into Emotions , 2008, ISMIR.

[32]  D. Lizotte Practical bayesian optimization , 2008 .

[33]  Dell Zhang,et al.  Classifying networked entities with modularity kernels , 2008, CIKM '08.

[34]  Shuicheng Yan,et al.  Semi-supervised Learning by Sparse Representation , 2009, SDM.

[35]  Chris H. Q. Ding,et al.  Image annotation using multi-label correlated Green's function , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36]  E. Vázquez,et al.  Convergence properties of the expected improvement algorithm with fixed mean and covariance functions , 2007, 0712.3744.

[37]  Chris H. Q. Ding,et al.  Image Categorization Using Directed Graphs , 2010, ECCV.

[38]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.