Factorized Asymptotic Bayesian Policy Search for POMDPs

This paper proposes a novel direct policy search (DPS) method with model selection for partially observed Markov decision processes (POMDPs). DPSs have been standard for learning POMDPs due to their computational efficiency and natural ability to maximize total rewards. An important open challenge for the best use of DPS methods is model selection, i.e., determination of the proper dimensionality of hidden states and complexity of policy functions, to mitigate overfitting in highlyflexible model representations of POMDPs. This paper bridges Bayesian inference and reward maximization and derives marginalized weighted loglikelihood (MWL) for POMDPs which takes both advantages of Bayesian model selection and DPS. Then we propose factorized asymptotic Bayesian policy search (FABPS) to explore the model and the policy which maximizes MWL by expanding recently-developed factorized asymptotic Bayesian inference. Experimental results show that FABPS outperforms state-of-the-art model selection methods for POMDPs, with respect both to model selection and to expected total rewards.

[1]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[2]  Zoubin Ghahramani,et al.  Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[3]  Christian Laugier,et al.  The International Journal of Robotics Research (IJRR) - Special issue on ``Field and Service Robotics '' , 2009 .

[4]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[5]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm , 2014, Machine Learning.

[6]  Kohei Hayashi,et al.  Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood , 2015, ICML.

[7]  Aníbal Ollero,et al.  Decentralized multi-robot cooperation with auctioned POMDPs , 2012, 2012 IEEE International Conference on Robotics and Automation.

[8]  Sumio Watanabe,et al.  Algebraic Geometry and Statistical Learning Theory: Contents , 2009 .

[9]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[10]  Pedro U. Lima,et al.  Active cooperative perception in network robot systems using POMDPs , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  V. G. Troitsky,et al.  Journal of Mathematical Analysis and Applications , 1960 .

[12]  Milica Gasic,et al.  The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management , 2010, Comput. Speech Lang..

[13]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[14]  Sergey Levine,et al.  Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[15]  Aníbal Ollero,et al.  Decentralized multi-robot cooperation with auctioned POMDPs , 2013, Int. J. Robotics Res..

[16]  Lambert Schomaker,et al.  2000 IEEE/RSJ International Conference On Intelligent Robots And Systems , 2000, Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000) (Cat. No.00CH37113).

[17]  Lawrence Carin,et al.  Learning to Explore and Exploit in POMDPs , 2009, NIPS.

[18]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[19]  Peter A. Flach,et al.  Proceedings of the 28th International Conference on Machine Learning , 2011 .

[20]  H. Damasio,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence: Special Issue on Perceptual Organization in Computer Vision , 1998 .

[21]  Jie Zhang,et al.  A POMDP based approach to optimally select sellers in electronic marketplaces , 2014, AAMAS.

[22]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[23]  Kohei Hayashi,et al.  Factorized Asymptotic Bayesian Inference for Latent Feature Models , 2013, NIPS.

[24]  G. Kitagawa,et al.  Information Criteria and Statistical Modeling , 2007 .

[25]  Michael Rovatsos,et al.  Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems , 2014, AAMAS 2014.

[26]  Volume 25 , 2005 .

[27]  Luc De Raedt,et al.  Proceedings of the 22nd international conference on Machine learning , 2005 .

[28]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[29]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[30]  Kohei Hayashi,et al.  Factorized Asymptotic Bayesian Hidden Markov Models , 2012, ICML.