Optimized Look-ahead Tree Search Policies

We consider in this paper look-ahead tree techniques for the discrete-time control of a deterministic dynamical system so as to maximize a sum of discounted rewards over an infinite time horizon. Given the current system state x t at time t, these techniques explore the look-ahead tree representing possible evolutions of the system states and rewards conditioned on subsequent actions u t , u t +1 , …. When the computing budget is exhausted, they output the action u t that led to the best found sequence of discounted rewards. In this context, we are interested in computing good strategies for exploring the look-ahead tree. We propose a generic approach that looks for such strategies by solving an optimization problem whose objective is to compute a (budget compliant) tree-exploration strategy yielding a control policy maximizing the average return over a postulated set of initial states. This generic approach is fully specified to the case where the space of candidate tree-exploration strategies are "best-first" strategies parameterized by a linear combination of look-ahead path features --- some of them having been advocated in the literature before --- and where the optimization problem is solved by using an EDA-algorithm based on Gaussian distributions. Numerical experiments carried out on a model of the treatment of the HIV infection show that the optimized tree-exploration strategy is orders of magnitudes better than the previously advocated ones.

[1]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[2]  Robert Givan,et al.  Learning Heuristic Functions from Relaxed Plans , 2006, ICAPS.

[3]  Louis Wehenkel,et al.  Clinical data based optimal STI strategies for HIV: a reinforcement learning approach , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[4]  Steven Minton,et al.  Machine Learning Methods for Planning , 1994 .

[5]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[6]  Romeo Ortega,et al.  Passivity of Nonlinear Incremental Systems: Application to PI Stabilization of Nonlinear RLC Circuits , 2006, CDC.

[7]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[8]  Pedro Larrañaga,et al.  Estimation of Distribution Algorithms , 2002, Genetic Algorithms and Evolutionary Computation.

[9]  Martin Pelikan,et al.  Marginal Distributions in Evolutionary Algorithms , 2007 .

[10]  Leslie Pack Kaelbling,et al.  Recent Advances in Reinforcement Learning , 1996, Springer US.

[11]  Francis Maes Learning in Markov decision processes for structured prediction : applications to sequence labeling, tree transformation and learning for search , 2009 .

[12]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[13]  Rémi Munos,et al.  Optimistic Planning of Deterministic Systems , 2008, EWRL.