论文信息 - Structure Learning for Safe Policy Improvement

Structure Learning for Safe Policy Improvement

We investigate how Safe Policy Improvement (SPI) algorithms can exploit the structure of factored Markov decision processes when such structure is unknown a priori. To facilitate the application of reinforcement learning in the real world, SPI provides probabilistic guarantees that policy changes in a running process will improve the performance of this process. However, current SPI algorithms have requirements that might be impractical, such as: (i) availability of a large amount of historical data, or (ii) prior knowledge of the underlying structure. To overcome these limitations we enhance a Factored SPI (FSPI) algorithm with different structure learning methods. The resulting algorithms need fewer samples to improve the policy and require weaker prior knowledge assumptions. In well-factorized domains, the proposed algorithms improve performance significantly compared to a flat SPI algorithm, demonstrating a sample complexity closer to an FSPI algorithm that knows the structure. This indicates that the combination of FSPI and structure learning algorithms is a promising solution to real-world problems involving many variables.

Matthijs T. J. Spaan | Thiago D. Simão | M. Spaan | T. D. Simão

[1] Shobha Venkataraman,et al. Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[2] Dale Schuurmans,et al. Algorithm-Directed Exploration for Model-Based Reinforcement Learning in Factored MDPs , 2002, ICML.

[3] Ben Calderhead,et al. Advances in Neural Information Processing Systems 29 , 2016 .

[4] M. Pollack. Journal of Artificial Intelligence Research: Preface , 2001 .

[5] Javier García,et al. A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[6] Marek Petrik,et al. Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[7] L. Goddard,et al. Operations Research (OR) , 2007 .

[8] Doran Chakraborty. Structure Learning in Factored MDPs , 2014 .

[9] Derong Liu,et al. 2007 IEEE international symposium on approximate dynamic programming and reinforcement learning , 2007 .

[10] Olivier Sigaud,et al. Learning the structure of Factored Markov Decision Processes in reinforcement learning problems , 2006, ICML.

[11] Thomas J. Walsh,et al. Knows what it knows: a framework for self-aware learning , 2008, ICML.

[12] Michael L. Littman,et al. Efficient Structure Learning in Factored-State MDPs , 2007, AAAI.

[13] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[14] Thomas G. Dietterich. The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[15] Matthijs T. J. Spaan,et al. Safe Policy Improvement with Baseline Bootstrapping in Factored Environments , 2019, AAAI.

[16] Dinh Phung,et al. Journal of Machine Learning Research: Preface , 2014 .