Solving Hidden-Semi-Markov-Mode Markov Decision Problems

Hidden-Mode Markov Decision Processes HM-MDPs were proposed to represent sequential decision-making problems in non-stationary environments that evolve according to a Markov chain. We introduce in this paper Hidden-Semi-Markov-Mode Markov Decision Process es HS3MDPs, a generalization of HM-MDPs to the more realistic case of non-stationary environments evolving according to a semi-Markov chain. Like HM-MDPs, HS3MDPs form a subclass of Partially Observable Markov Decision Processes. Therefore, large instances of HS3MDPs and HM-MDPs can be solved using an online algorithm, the Partially Observable Monte Carlo Planning POMCP algorithm, based on Monte Carlo Tree Search exploiting particle filters for belief state approximation. We propose a first adaptation of POMCP to solve HS3MDPs more efficiently by exploiting their structure. Our empirical results show that the first adapted POMCP reaches higher cumulative rewards than the original algorithm. However, in larger instances, POMCP may run out of particles. To solve this issue, we propose a second adaptation of POMCP, replacing particle filters by exact representations of beliefs. Our empirical results indicate that this new version reaches high cumulative rewards faster than the former adapted POMCP and still remains efficient even for large problems.

[1]  Paulo Martins Engel,et al.  Dealing with non-stationary environments using context detection , 2006, ICML.

[2]  Neil D. Lawrence,et al.  Missing Data in Kernel PCA , 2006, ECML.

[3]  Michael L. Littman,et al.  Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[4]  Dit-Yan Yeung,et al.  Reinforcement learning in nonstationary environments , 2000 .

[5]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[6]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[7]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[8]  Mitsuo Kawato,et al.  Multiple Model-Based Reinforcement Learning , 2002, Neural Computation.

[9]  Olivier Buffet,et al.  MOMDPs: A Solution for Modelling Adaptive Management Problems , 2012, AAAI.

[10]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[11]  David Hsu,et al.  POMDPs for robotic tasks with mixed observability , 2009, Robotics: Science and Systems.

[12]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13]  Dit-Yan Yeung,et al.  An Environment Model for Nonstationary Reinforcement Learning , 1999, NIPS.

[14]  Dit-Yan Yeung,et al.  Solving Hidden-Mode Markov Decision Problems , 2001, AISTATS.

[15]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[16]  Jacques Janssen Semi-Markov Models , 1986 .

[17]  Shunzheng Yu,et al.  Hidden semi-Markov models , 2010, Artif. Intell..

[18]  Olivier Buffet,et al.  A Closer Look at MOMDPs , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.