On-Line Search for Solving Markov Decision Processes via Heuristic Sampling

In the past, Markov Decision Processes (MDPs) have become a standard for solving problems of sequential decision under uncertainty. The usual request in this framework is the computation of an optimal policy that defines an optimal action for every state of the system. For complex MDPs, exact computation of optimal policies is often untractable. Several approaches have been developed to compute near optimal policies for complex MDPs by means of function approximation and simulation. In this paper, we investigate the problem of refining near optimal policies via online search techniques, tackling the local problem of finding an optimal action for a single current state of the system. More precisely we consider an on-line approach based on sampling: at each step, a randomly sampled look-ahead tree is developed to compute the optimal action for the current state. In this work, we propose a search strategy for constructing such trees. Its purpose is to provide good "anytime" profiles: at first, it quickly selects a good action with a high probability and then it smoothly increases the probability of selecting an optimal action.

[1]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.

[2]  Judea Pearl,et al.  On the Nature of Pathology in Game Searching , 1983, Artif. Intell..

[3]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[4]  O. Hernández-Lerma,et al.  Error bounds for rolling horizon policies in discrete-time Markov control processes , 1990 .

[5]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[8]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[10]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[11]  Chun-Hung Chen,et al.  Computing efforts allocation for ordinal optimization and discrete event simulation , 2000, IEEE Trans. Autom. Control..

[12]  Shlomo Zilberstein,et al.  LAO*: A heuristic search algorithm that finds solutions with loops , 2001, Artif. Intell..

[13]  Murray Campbell,et al.  Deep Blue , 2002, Artif. Intell..

[14]  Toulouse Cédex,et al.  Deployment and Maintenance of a Constellation of Satellites : a Benchmark , 2003 .

[15]  Vadim Bulitko,et al.  Lookahead Pathologies for Single Agent Search , 2003, IJCAI.

[16]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[17]  Paul Bourgine,et al.  Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty , 1999, Machine Learning.

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  Michael C. Fu,et al.  An Adaptive Sampling Algorithm for Solving Markov Decision Processes , 2005, Oper. Res..