A Framework for Aggregation of Multiple Reinforcement Learning Algorithms

Aggregation of multiple Reinforcement Learning (RL) algorithms is a new and effective technique to improve the quality of Sequential Decision Making (SDM). SDM is very common and important in various realistic applications, especially in automatic control problems. The quality of a SDM depends on (discounted) long-term rewards rather than the instant rewards. Due to delayed feedback, SDM tasks are much more difficult to handle than classification problems. Meanwhile, in many SDM tasks, the feedback about a decision is often in the form of evaluation rather than instruction. Therefore, supervised learning techniques are not suitable in these tasks. To tackle these difficulties, RL methods are investigated. Although many RL algorithms have been developed, none is consistently better than the others. In addition, the parameters of RL algorithms significantly influence learning performances. Successful RL applications depend on suitable learning algorithms and elaborately selected learning parameters, but there is no universal rule to guide the choice of algorithms and the setting of parameters. To handle this difficulty, a new multiple RL system - the Aggregated Multiple Reinforcement Learning System (AMRLS) is developed. In this proposed system, each RL algorithm (learner) learns individually in a learning module and provides its output to an intelligent aggregation module. The aggregation module dynamically aggregates these outputs by using some intelligent aggregation methods and provides a decision of action. Then, all learners take the action and update their policies individually. The two processes are performed alternatively in each learning episode. Because of the intelligent and dynamic aggregation, AMRLS has the ability to deal with dynamic learning problems without the need to search for the optimal learning algorithm or the optimal values of learning parameters. It is claimed that several complementary learning algorithms can be integrated in the AMRLS to improve the learning performance in terms of success rate, robustness, confidence, redundance, and complementariness. There are two strategies for learning an optimal policy by using RL methods. One is based on the Value Function Learning (VFL) strategy, which learns an optimal policy expressed as a value function. The Temporal Difference (TD) methods are examples of this strategy and they are called TDRL in this dissertation. The other strategy is based on the Direct Policy Search (DPS), which directly searches for the optimal policy in the potential policy space. The Genetic Algorithms (GAs)-based search algorithms are instances of this strategy and they are named GARL. Both of the strategies exhibit advantages and disadvantages. A hybrid learning architecture of GARL and TDRL, HGATDRL, is proposed to combine them together. HGATDRL uses an off-line GARL approach to learn an initial policy first, and then updates the policy on-line by using a TDRL approach. This new learning method enhances the learning ability of RL learners in AMRLS. The AMRLS framework and HGATDRL method are tested on several SDM problems, including the maze world problem, pursuit domain problem, cart-pole balancing system, mountain car problem, and flight control system. The experimental results show that the proposed framework and method can enhance the learning ability and improve learning performance of a multiple RL system.

[1]  John Musacchio,et al.  Genetic adaptive control for an inverted wedge , 1999, Proceedings of the 1999 American Control Conference (Cat. No. 99CH36251).

[2]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[3]  James C. Bezdek,et al.  Decision templates for multiple classifier fusion: an experimental comparison , 2001, Pattern Recognit..

[4]  Lawrence. Davis,et al.  Handbook Of Genetic Algorithms , 1990 .

[5]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[6]  Shigenobu Kobayashi,et al.  Edge Assembly Crossover: A High-Power Genetic Algorithm for the Travelling Salesman Problem , 1997, ICGA.

[7]  Carl H. Smith,et al.  Probability and Plurality for Aggregations of Learning Machines , 1987, Inf. Comput..

[8]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[9]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[10]  Shu-Fan Wu,et al.  On-line free-flight path optimization based on improved genetic algorithms , 2004, Eng. Appl. Artif. Intell..

[11]  Randall D. Beer,et al.  Evolving Dynamical Neural Networks for Adaptive Behavior , 1992, Adapt. Behav..

[12]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[13]  Ron Sun,et al.  Multi-agent reinforcement learning: weighting and partitioning , 1999, Neural Networks.

[14]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[15]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[16]  Andrew Y. Ng,et al.  Shaping and policy search in reinforcement learning , 2003 .

[17]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[18]  Mohamed S. Kamel,et al.  Intelligent information fusion approach in cooperative multiagent systems , 2002, Proceedings of the 5th Biannual World Automation Congress.

[19]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[20]  Mohamed S. Kamel,et al.  Aggregation of Multiple Reinforcement Learning Algorithms , 2006, Int. J. Artif. Intell. Tools.

[21]  Mohamed S. Kamel,et al.  Pitch Control of an Aircraft with Aggregated Reinforcement Learning Algorithms , 2007, 2007 International Joint Conference on Neural Networks.

[22]  Marco Colombetti,et al.  Robot Shaping: An Experiment in Behavior Engineering , 1997 .

[23]  Ming Tan,et al.  Multi-Agent Reinforcement Learning: Independent versus Cooperative Agents , 1997, ICML.

[24]  Hyongsuk Kim,et al.  CMAC-based adaptive critic self-learning control , 1991, IEEE Trans. Neural Networks.

[25]  Manuela M. Veloso,et al.  Multiagent Systems: A Survey from a Machine Learning Perspective , 2000, Auton. Robots.

[26]  Sachiyo Arai,et al.  Multi-agent reinforcement learning for planning and scheduling multiple goals , 2000, Proceedings Fourth International Conference on MultiAgent Systems.

[27]  Wilfried Brauer,et al.  Fuzzy Model-Based Reinforcement Learning , 2002, Advances in Computational Intelligence and Learning.

[28]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[29]  W. Riker,et al.  Liberalism Against Populism: A Confrontation Between the Theory of Democracy and the Theory of Social Choice , 1982 .

[30]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Steven D. Whitehead,et al.  A Complexity Analysis of Cooperative Mechanisms in Reinforcement Learning , 1991, AAAI.

[32]  Belur V. Dasarathy,et al.  Decision fusion , 1994 .

[33]  S.O.R. Moheimani,et al.  Optimal quadratic guaranteed cost control of a class of uncertain time-delay systems , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[34]  John David Anderson,et al.  Introduction to Flight , 1985 .

[35]  Martin A. Riedmiller,et al.  Karlsruhe Brainstormers - A Reinforcement Learning Approach to Robotic Soccer , 2000, RoboCup.

[36]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[37]  Mohamed S. Kamel,et al.  Aggregation of Reinforcement Learning Algorithms , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[38]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[39]  SRIDHAR MAHADEVAN,et al.  Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.

[40]  Jun Tani,et al.  Model-based learning for mobile robot navigation from the dynamical systems perspective , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[41]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[42]  John J. Grefenstette,et al.  Learning sequential decision rules using simulation models and competition , 2004, Machine Learning.

[43]  Mohamed S. Kamel,et al.  Reinforcement learning and aggregation , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[44]  Francesco Mondada,et al.  Evolution of homing navigation in a real mobile robot , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[45]  Mohamed A. Zohdy,et al.  Reinforcement learning control of nonlinear multi-link system , 2001 .

[46]  Shimon Whiteson,et al.  Evolutionary Function Approximation for Reinforcement Learning , 2006, J. Mach. Learn. Res..

[47]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[48]  J. M. Porta,et al.  Reinforcement Learning for Agents with Many Sensors and Actuators Acting in Categorizable Environments , 2011, J. Artif. Intell. Res..

[49]  Xin Wang,et al.  Batch Value Function Approximation via Support Vectors , 2001, NIPS.

[50]  John Daniel. Bagley,et al.  The behavior of adaptive systems which employ genetic and correlation algorithms : technical report , 1967 .

[51]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[52]  Tom Lenaerts,et al.  A selection-mutation model for q-learning in multi-agent systems , 2003, AAMAS '03.

[53]  Mohamed S. Kamel,et al.  Data Dependence in Combining Classifiers , 2003, Multiple Classifier Systems.

[54]  Andrew W. Moore,et al.  Direct Policy Search using Paired Statistical Tests , 2001, ICML.

[55]  C. L. Giles,et al.  Sequence Learning - Paradigms, Algorithms, and Applications , 2001 .

[56]  James S. Albus,et al.  New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)1 , 1975 .

[57]  L. Darrell Whitley,et al.  Genetic Reinforcement Learning with Multilayer Neural Networks , 1991, ICGA.

[58]  Ralf Schoknecht,et al.  Optimality of Reinforcement Learning Algorithms with Linear Function Approximation , 2002, NIPS.

[59]  R. Sepulchre,et al.  A hybrid control scheme for swing-up acrobatics , 2001, 2001 European Control Conference (ECC).

[60]  Eduardo F. Morales,et al.  Learning to fly by combining reinforcement learning with behavioural cloning , 2004, ICML.

[61]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[62]  Jürgen Schmidhuber,et al.  Sequential Decision Making Based on Direct Search , 2001, Sequence Learning.

[63]  John J. Grefenstette,et al.  Evolutionary Algorithms for Reinforcement Learning , 1999, J. Artif. Intell. Res..

[64]  Risto Miikkulainen,et al.  Efficient Reinforcement Learning through Symbiotic Evolution , 2004 .

[65]  Heinz Mühlenbein,et al.  Predictive Models for the Breeder Genetic Algorithm I. Continuous Parameter Optimization , 1993, Evolutionary Computation.

[66]  Zbigniew Michalewicz,et al.  An Experimental Comparison of Binary and Floating Point Representations in Genetic Algorithms , 1991, ICGA.

[67]  Kokolo Ikeda Genetic policy search using exemplar based representations , 2004 .

[68]  L. Darrell Whitley,et al.  Genetic Reinforcement Learning for Neurocontrol Problems , 2004, Machine Learning.

[69]  Gene F. Franklin,et al.  Feedback Control of Dynamic Systems , 1986 .

[70]  R. Bellman A Markovian Decision Process , 1957 .

[71]  Benjamin Kuipers,et al.  Qualitative Heterogeneous Control of Higher Order Systems , 2003, HSCC.

[72]  Gao Zheng ADAPTIVE NEURAL NETWORK ATTITUDE CONTROL FOR UNMANNED HELICOPTER , 2004 .

[73]  Lin Chun-Shin,et al.  CMAC with General Basis Functions. , 1996, Neural networks : the official journal of the International Neural Network Society.

[74]  Lawrence Davis,et al.  Genetic Algorithms and Simulated Annealing , 1987 .

[75]  I. Horowitz Survey of quantitative feedback theory (QFT) , 2001 .

[76]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[77]  Fakhri Karray,et al.  Feature-based decision aggregation in modular neural network classifiers , 1999, Pattern Recognit. Lett..

[78]  Shimon Whiteson,et al.  Comparing evolutionary and temporal difference methods in a reinforcement learning domain , 2006, GECCO.

[79]  Andrés Pérez Uribe,et al.  Structure-Adaptable Digital Neural Networks , 1999 .

[80]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[81]  Ching Y. Suen,et al.  A Method of Combining Multiple Experts for the Recognition of Unconstrained Handwritten Numerals , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[82]  Antanas Verikas,et al.  Soft combination of neural classifiers: A comparative study , 1999, Pattern Recognit. Lett..

[83]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[84]  Mohamed S. Kamel,et al.  Learning Coordination Strategies for Cooperative Multiagent Systems , 1998, Machine Learning.

[85]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[86]  Gerhard Weiss,et al.  Multiagent systems: a modern approach to distributed artificial intelligence , 1999 .

[87]  Akira Oyama,et al.  Real-coded adaptive range genetic algorithm applied to transonic wing optimization , 2000, Appl. Soft Comput..

[88]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[89]  Donald E. Knuth,et al.  Sorting and Searching , 1973 .

[90]  Arthur E. Bryson,et al.  Control of spacecraft and aircraft , 1994 .

[91]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[92]  Paul D. Gader,et al.  Fusion of handwritten word classifiers , 1996, Pattern Recognit. Lett..

[93]  Craig Boutilier,et al.  Coordination in multiagent reinforcement learning: a Bayesian approach , 2003, AAMAS '03.