Cross-Entropy Optimization of Control Policies With Adaptive Basis Functions

This paper introduces an algorithm for direct search of control policies in continuous-state discrete-action Markov decision processes. The algorithm looks for the best closed-loop policy that can be represented using a given number of basis functions (BFs), where a discrete action is assigned to each BF. The type of the BFs and their number are specified in advance and determine the complexity of the representation. Considerable flexibility is achieved by optimizing the locations and shapes of the BFs, together with the action assignments. The optimization is carried out with the cross-entropy method and evaluates the policies by their empirical return from a representative set of initial states. The return for each representative state is estimated using Monte Carlo simulations. The resulting algorithm for cross-entropy policy search with adaptive BFs is extensively evaluated in problems with two to six state variables, for which it reliably obtains good policies with only a small number of BFs. In these experiments, cross-entropy policy search requires vastly fewer BFs than value-function techniques with equidistant BFs, and outperforms policy search with a competing optimization algorithm called DIRECT.

[1]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[2]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[3]  John Rust Numerical dynamic programming in economics , 1996 .

[4]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[5]  A. A. Jafari,et al.  Genetic algorithm methods for solving the best stationary policy of finite Markov decision processes , 1998, Proceedings of Thirtieth Southeastern Symposium on System Theory.

[6]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[7]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[8]  Danny Barash,et al.  A Genetic Search In Policy Space For Solving Markov Decision Processes , 1999 .

[9]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[11]  John N. Tsitsiklis,et al.  Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes , 2003, Discret. Event Dyn. Syst..

[12]  Shie Mannor,et al.  The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[13]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[14]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[15]  Andrew W. Moore,et al.  Variable Resolution Discretization in Optimal Control , 2002, Machine Learning.

[16]  B. Adams,et al.  Dynamic multidrug therapies for hiv: optimal and sti control approaches. , 2004, Mathematical biosciences and engineering : MBE.

[17]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[18]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[19]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Suboptimal Control: A Survey from ADP to MPC , 2005, Eur. J. Control.

[20]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22]  Rémi Munos,et al.  Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[23]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[24]  Louis Wehenkel,et al.  Clinical data based optimal STI strategies for HIV: a reinforcement learning approach , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[25]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[26]  R. Clayton,et al.  Epicardial ECG Mapping of Human Ventricular Fibrillation , 2006 .

[27]  Lih-Yuan Deng,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning , 2006, Technometrics.

[28]  Kwong-Sak Leung,et al.  A Memetic Algorithm for Multiple-Drug Cancer Chemotherapy Schedule Optimization , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[29]  Dirk P. Kroese,et al.  Convergence properties of the cross-entropy method for discrete optimization , 2007, Oper. Res. Lett..

[30]  Bart De Schutter,et al.  Continuous-State Reinforcement Learning with Fuzzy Approximation , 2007, Adaptive Agents and Multi-Agents Systems.

[31]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[32]  Martin A. Riedmiller,et al.  Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[33]  Christopher G. Atkeson,et al.  Random Sampling of States in Dynamic Programming , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[34]  Derong Liu,et al.  Adaptive Critic Learning Techniques for Engine Torque and Air–Fuel Ratio Control , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[35]  Christos Dimitrakakis,et al.  Rollout sampling approximate policy iteration , 2008, Machine Learning.

[36]  Bart De Schutter,et al.  Policy search with cross-entropy optimization of basis functions , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[37]  Donald R. Jones,et al.  Direct Global Optimization Algorithm , 2009, Encyclopedia of Optimization.

[38]  Steven I. Marcus,et al.  Simulation-based Algorithms for Markov Decision Processes/ Hyeong Soo Chang ... [et al.] , 2013 .