论文信息 - Optimized Look-ahead Tree Search Policies

Optimized Look-ahead Tree Search Policies

We consider in this paper look-ahead tree techniques for the discrete-time control of a deterministic dynamical system so as to maximize a sum of discounted rewards over an infinite time horizon. Given the current system state x t at time t, these techniques explore the look-ahead tree representing possible evolutions of the system states and rewards conditioned on subsequent actions u t , u t +1 , …. When the computing budget is exhausted, they output the action u t that led to the best found sequence of discounted rewards. In this context, we are interested in computing good strategies for exploring the look-ahead tree. We propose a generic approach that looks for such strategies by solving an optimization problem whose objective is to compute a (budget compliant) tree-exploration strategy yielding a control policy maximizing the average return over a postulated set of initial states. This generic approach is fully specified to the case where the space of candidate tree-exploration strategies are "best-first" strategies parameterized by a linear combination of look-ahead path features --- some of them having been advocated in the literature before --- and where the optimization problem is solved by using an EDA-algorithm based on Gaussian distributions. Numerical experiments carried out on a model of the treatment of the HIV infection show that the optimized tree-exploration strategy is orders of magnitudes better than the previously advocated ones.

[1] Richard E. Korf,et al. Real-Time Heuristic Search , 1990, Artif. Intell..

[2] Robert Givan,et al. Learning Heuristic Functions from Relaxed Plans , 2006, ICAPS.

[3] Louis Wehenkel,et al. Clinical data based optimal STI strategies for HIV: a reinforcement learning approach , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[4] Steven Minton,et al. Machine Learning Methods for Planning , 1994 .

[5] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[6] Romeo Ortega,et al. Passivity of Nonlinear Incremental Systems: Application to PI Stabilization of Nonlinear RLC Circuits , 2006, CDC.

[7] Nils J. Nilsson,et al. A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[8] Pedro Larrañaga,et al. Estimation of Distribution Algorithms , 2002, Genetic Algorithms and Evolutionary Computation.

[9] Martin Pelikan,et al. Marginal Distributions in Evolutionary Algorithms , 2007 .

[10] Leslie Pack Kaelbling,et al. Recent Advances in Reinforcement Learning , 1996, Springer US.

[11] Francis Maes. Learning in Markov decision processes for structured prediction : applications to sequence labeling, tree transformation and learning for search , 2009 .

[12] J. A. Lozano,et al. Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[13] Rémi Munos,et al. Optimistic Planning of Deterministic Systems , 2008, EWRL.