Robust partially observable Markov decision process

We seek to find the robust policy that maximizes the expected cumulative reward for the worst case when a partially observable Markov decision process (POMDP) has uncertain parameters whose values are only known to be in a given region. We prove that the robust value function, which represents the expected cumulative reward that can be obtained with the robust policy, is convex with respect to the belief state. Based on the convexity, we design a value-iteration algorithm for finding the robust policy. We prove that our value iteration converges for an infinite horizon. We also design point-based value iteration for fining the robust policy more efficiency possibly with approximation. Numerical experiments show that our point-based value iteration can adequately find robust policies.

[1]  Hideaki Itoh,et al.  Partially observable Markov decision processes with imprecise parameters , 2007, Artif. Intell..

[2]  George E. Monahan,et al.  A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 2007 .

[3]  Guy Shani,et al.  Noname manuscript No. (will be inserted by the editor) A Survey of Point-Based POMDP Solvers , 2022 .

[4]  Milica Gasic,et al.  The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management , 2010, Comput. Speech Lang..

[5]  Russ Bubley,et al.  Randomized algorithms , 2018, CSUR.

[6]  Milica Gasic,et al.  POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[7]  Craig Boutilier,et al.  Stochastic Local Search for POMDP Controllers , 2004, AAAI.

[8]  Pascal Poupart,et al.  Bayesian Reinforcement Learning , 2010, Encyclopedia of Machine Learning.

[9]  Takayuki Osogami,et al.  Robustness and risk-sensitivity in Markov decision processes , 2012, NIPS.

[10]  Joelle Pineau,et al.  Towards robotic assistants in nursing homes: Challenges and results , 2003, Robotics Auton. Syst..

[11]  David Hsu,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[12]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[13]  Joelle Pineau,et al.  A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes , 2011, J. Mach. Learn. Res..

[14]  Takaki Makino,et al.  Apprenticeship Learning for Model Parameters of Partially Observable Environments , 2012, ICML.

[15]  Yaodong Ni,et al.  Policy iteration for bounded-parameter POMDPs , 2012, Soft Computing.

[16]  P. Poupart Exploiting structure to efficiently solve large scale partially observable Markov decision processes , 2005 .

[17]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[18]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[19]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[20]  Rüdiger Dillmann,et al.  Solving Continuous POMDPs: Value Iteration with Incremental Learning of an Efficient Space Representation , 2013, ICML.

[21]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[22]  Jesse Hoey,et al.  Automated handwashing assistance for persons with dementia using video and a partially observable Markov decision process , 2010, Comput. Vis. Image Underst..

[23]  Oliver Brock,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2009 .

[24]  David Hsu,et al.  Covering Number for Efficient Heuristic-based POMDP Planning , 2014, ICML.

[25]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[26]  Finale Doshi-Velez,et al.  The Infinite Partially Observable Markov Decision Process , 2009, NIPS.

[27]  Zhi-Qiang Liu,et al.  Bounded-parameter Partially Observable Markov Decision Processes: Framework and Algorithm , 2013, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[28]  Guy Shani,et al.  Forward Search Value Iteration for POMDPs , 2007, IJCAI.

[29]  Guy Shani,et al.  Improving Existing Fault Recovery Policies , 2009, NIPS.

[30]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[31]  Guy Shani,et al.  Model-Based Online Learning of POMDPs , 2005, ECML.