论文信息 - Bayesian Control of Large MDPs with Unknown Dynamics in Data-Poor Environments

Bayesian Control of Large MDPs with Unknown Dynamics in Data-Poor Environments

We propose a Bayesian decision making framework for control of Markov Decision Processes (MDPs) with unknown dynamics and large, possibly continuous, state, action, and parameter spaces in data-poor environments. Most of the existing adaptive controllers for MDPs with unknown dynamics are based on the reinforcement learning framework and rely on large data sets acquired by sustained direct interaction with the system or via a simulator. This is not feasible in many applications, due to ethical, economic, and physical constraints. The proposed framework addresses the data poverty issue by decomposing the problem into an offline planning stage that does not rely on sustained direct interaction with the system or simulator and an online execution stage. In the offline process, parallel Gaussian process temporal difference (GPTD) learning techniques are employed for near-optimal Bayesian approximation of the expected discounted reward over a sample drawn from the prior distribution of unknown parameters. In the online stage, the action with the maximum expected return with respect to the posterior distribution of the parameters is selected. This is achieved by an approximation of the posterior distribution using a Markov Chain Monte Carlo (MCMC) algorithm, followed by constructing multiple Gaussian processes over the parameter space for efficient prediction of the means of the expected return at the MCMC sample. The effectiveness of the proposed framework is demonstrated using a simple dynamical system model with continuous state and action spaces, as well as a more complex model for a metastatic melanoma gene regulatory network observed through noisy synthetic gene expression data.

[1] Tao Wang,et al. Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[2] W. K. Hastings,et al. Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[3] Wolfgang Ertel,et al. Monte-Carlo tree search for Bayesian reinforcement learning , 2012, Applied Intelligence.

[4] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[5] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[6] Peter Dayan,et al. Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[7] Bart De Schutter,et al. Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[8] Joelle Pineau,et al. Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[9] Ulisses Braga-Neto,et al. Finite-horizon LQR controller for partially-observed Boolean dynamical systems , 2018, Autom..

[10] Shie Mannor,et al. Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[11] Peter Dayan,et al. Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search , 2013, J. Artif. Intell. Res..

[12] Milica Gasic,et al. Gaussian Processes for POMDP-Based Dialogue Manager Optimization , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13] Shie Mannor,et al. Reinforcement learning with Gaussian processes , 2005, ICML.

[14] Ulisses Braga-Neto,et al. ParticleFilters for Partially-ObservedBooleanDynamical Systems , 2017 .

[15] N. Sampas,et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[16] Michael L. Littman,et al. Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search , 2011, UAI.

[17] M. Bittner,et al. Wnt5a signaling directly affects cell motility and invasion of metastatic melanoma. , 2002, Cancer cell.

[18] David Hsu,et al. Monte Carlo Bayesian Reinforcement Learning , 2012, ICML.

[19] Fabio Gagliardi Cozman,et al. Planning under Risk and Knightian Uncertainty , 2007, IJCAI.

[20] Jean-Loup Farges,et al. Structured Possibilistic Planning Using Decision Diagrams , 2014, AAAI.

[21] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22] Csaba Szepesvári,et al. Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[23] Mahdi Imani,et al. Point-Based Methodology to Monitor and Control Gene Regulatory Networks via Noisy Measurements , 2019, IEEE Transactions on Control Systems Technology.

[24] Mitsuo Kawato,et al. Multiple Model-Based Reinforcement Learning , 2002, Neural Computation.

[25] Ulisses Braga-Neto,et al. Optimal state estimation for Boolean dynamical systems , 2011, 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR).

[26] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[27] Jesse Hoey,et al. An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[28] Lucian Busoniu,et al. Optimistic planning for belief-augmented Markov Decision Processes , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).