Large-scale dynamic optimization using teams of reinforcement learning agents

Recent algorithmic and theoretical advances in reinforcement learning (RL) are attracting widespread interest. RL algorithms have appeared that approximate dynamic programming (DP) on an incremental basis. Unlike traditional DP algorithms, these algorithms do not require knowledge of the state transition probabilities or reward structure of a system. This allows them to be trained using real or simulated experiences, focusing their computations on the areas of state space that are actually visited during control, making them computationally tractable on very large problems. RL algorithms can be used as components of multi-agent algorithms. If each member of a team of agents employs one of these algorithms, a new collective learning algorithm emerges for the team as a whole. In this dissertation we demonstrate that such collective RL algorithms can be powerful heuristic methods for addressing large-scale control problems. Elevator group control serves as our primary testbed. The elevator domain poses a combination of challenges not seen in most RL research to date. Elevator systems operate in continuous state spaces and in continuous time as discrete event dynamic systems. Their states are not fully observable and they are non-stationary due to changing passenger arrival rates. As a way of streamlining the search through policy space, we use a team of RL agents, each of which is responsible for controlling one elevator car. The team receives a global reinforcement signal which appears noisy to each agent due to the effects of the actions of the other agents, the random nature of the arrivals and the incomplete observation of the state. In spite of these complications, we show results that in simulation surpass the best of the heuristic elevator control algorithms of which we are aware. These results demonstrate the power of RL on a very large scale stochastic dynamic optimization problem of practical utility.

[1]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[3]  H. Witsenhausen A Counterexample in Stochastic Optimum Control , 1968 .

[4]  H. Witsenhausen Separation of estimation and control for discrete time systems , 1971 .

[5]  R. Radner,et al.  Economic theory of teams , 1972 .

[6]  M. L. Tsetlin,et al.  Automaton theory and modeling of biological systems , 1973 .

[7]  M. Yadin,et al.  Optimal control of elevators , 1977 .

[8]  J. Walrand,et al.  On delayed sharing patterns , 1978 .

[9]  C.C. White,et al.  Dynamic programming and stochastic control , 1978, Proceedings of the IEEE.

[10]  Yu-Chi Ho Team decision theory and information structures , 1980, Proceedings of the IEEE.

[11]  B. Chandrasekaran,et al.  Natural and Social System Metaphors for Distributed Problem Solving: Introduction to the Issue , 1981, IEEE Transactions on Systems, Man, and Cybernetics.

[12]  R. Aumann Survey of Repeated Games , 1981 .

[13]  S. Lakshmivarahan,et al.  Learning Algorithms for Two-Person Zero-Sum Stochastic Games with Incomplete Information , 1981, Math. Oper. Res..

[14]  K. Narendra,et al.  Learning Algorithms for Two-Person Zero-Sum Stochastic Games with Incomplete Information: A Unified Approach , 1982 .

[15]  S. Marcus,et al.  Decentralized control of finite state Markov processes , 1982 .

[16]  Randall Davis,et al.  Negotiation as a Metaphor for Distributed Problem Solving , 1988, Artif. Intell..

[17]  George R. Strakosch,et al.  Vertical Transportation: Elevators and Escalators , 1983 .

[18]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[19]  W. Hamilton,et al.  The Evolution of Cooperation , 1984 .

[20]  P. Anandan,et al.  Pattern-recognizing stochastic learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[21]  A G Barto,et al.  Learning by statistical cooperation of self-interested neuron-like computing elements. , 1985, Human neurobiology.

[22]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[23]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[24]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[25]  M. Aicardi,et al.  Decentralized optimal control of Markov chains with a common past information set , 1987 .

[26]  Edmund H. Durfee,et al.  Coordination of distributed problem solvers , 1988 .

[27]  H. Ujihara,et al.  THE REVOLUTIONARY AI-2100 ELEVATOR-GROUP CONTROL SYSTEM AND THE NEW INTELLIGENT OPTION SERIES , 1988 .

[28]  Andrew G. Barto,et al.  From Chemotaxis to cooperativity: abstract exercises in neuronal learning strategies , 1989 .

[29]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[30]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[31]  Francis Crick,et al.  The recent excitement about neural networks , 1989, Nature.

[32]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[33]  H. Sabourian Repeated Games: A Survey , 1989 .

[34]  Rodney A. Brooks,et al.  Learning to Coordinate Behaviors , 1990, AAAI.

[35]  F. Hahn The Economics of missing markets, information, and games , 1990 .

[36]  Dana H. Ballard,et al.  Active Perception and Reinforcement Learning , 1990, Neural Computation.

[37]  Ramanathan V. Guha,et al.  CYC: A Midterm Report , 1990, AI Mag..

[38]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[39]  Ming Tan,et al.  Learning a Cost-Sensitive Internal Representation for Reinforcement Learning , 1991, ML.

[40]  Edmund H. Durfee,et al.  THE DISTRIBUTED ARTIFICIAL INTELLIGENCE MELTING POT , 1991 .

[41]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[42]  Eithan Ephrati,et al.  The Clarke Tax as a Consensus Mechanism Among Automated Agents , 1991, AAAI.

[43]  Hiromi Inaba,et al.  An elevator characterized group supervisory control system , 1991, Proceedings IECON '91: 1991 International Conference on Industrial Electronics, Control and Instrumentation.

[44]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[45]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[46]  Rich Caruana,et al.  Intelligent Agent Design Issues: Internal Agent State and Incomplete Perception , 1991 .

[47]  Grantham K. H. Pang Elevator scheduling system using blackboard architecture , 1991 .

[48]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[49]  Anne H. Soukhanov,et al.  The american heritage dictionary of the english language , 1992 .

[50]  Seppo J. Ovaska,et al.  Electronics and information technology in high-range elevator systems , 1992 .

[51]  Andrew W. Moore,et al.  Memory-Based Reinforcement Learning: Efficient Computation with Prioritized Sweeping , 1992, NIPS.

[52]  Brahim Chaib-draa,et al.  Distributed artificial intelligence: an annotated bibliography , 1992, SGAR.

[53]  James Alan Lewis,et al.  A dynamic load balancing approach to the control of multi-server polling systems with applications to elevator system dispatching , 1992 .

[54]  Moshe Tennenholtz,et al.  Emergent Conventions in Multi-Agent Systems: Initial Experimental Results and Observations (Preliminary Report) , 1992, KR.

[55]  Toshimitsu Tobita,et al.  An online tuning method for multiobjective control of elevator group , 1992, Proceedings of the 1992 International Conference on Industrial Electronics, Control, Instrumentation, and Automation.

[56]  Vijaykumar Gullapalli,et al.  Reinforcement learning and its application to control , 1992 .

[57]  Mark B. Ring Learning Sequential Tasks by Incrementally Adding Higher Orders , 1992, NIPS.

[58]  Long Lin,et al.  Memory Approaches to Reinforcement Learning in Non-Markovian Domains , 1992 .

[59]  Michael L. Littman,et al.  A Distributed Reinforcement Learning Scheme for Network Routing , 1993 .

[60]  Suresh K. Khator,et al.  Smart lifts: controls design and performance evaluation , 1993 .

[61]  Jude W. Shavlik,et al.  Learning Symbolic Rules Using Artificial Neural Networks , 1993, ICML.

[62]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[63]  Rich Caruana,et al.  Learning Many Related Tasks at the Same Time with Backpropagation , 1994, NIPS.

[64]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[65]  Edwin K. P. Chong,et al.  Discrete event systems: Modeling and performance analysis , 1994, Discret. Event Dyn. Syst..

[66]  Richard W. Prager,et al.  A Modular Q-Learning Architecture for Manipulator Task Decomposition , 1994, ICML.

[67]  G. Kane Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1: Foundations, vol 2: Psychological and Biological Models , 1994 .

[68]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[69]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[70]  Andrew McCallum,et al.  Instance-Based State Identification for Reinforcement Learning , 1994, NIPS.

[71]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[72]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[73]  Sandip Sen,et al.  Learning to Coordinate without Sharing Information , 1994, AAAI.

[74]  Andrew G. Barto,et al.  An Actor/Critic Algorithm that is Equivalent to Q-Learning , 1994, NIPS.

[75]  Gerhard Weiss,et al.  Some Studies in Distributed Machine Learning and Organizational Design , 1994 .

[76]  Hajime Kita,et al.  Adaptive Optimal Elevator Group Control by Use of Neural Networks , 1994 .

[77]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[78]  Nicholas R. Jennings,et al.  Controlling Cooperative Problem Solving in Industrial Multi-Agent Systems Using Joint Intentions , 1995, Artif. Intell..

[79]  Victor R. Lesser,et al.  Learning Coordination Plans in Distributed Problem-Solving Environments , 1995, ICMAS.

[80]  Mark Humphrys W-learning: Competition among selfish Q-learners , 1995 .

[81]  Maja J. Mataric,et al.  Issues and approaches in the design of collective autonomous agents , 1995, Robotics Auton. Syst..

[82]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[83]  Maja J. Mataric,et al.  Learning in Multi-Robot Systems , 1995, Adaption and Learning in Multi-Agent Systems.

[84]  Gerhard Weiß,et al.  Adaptation and Learning in Multi-Agent Systems: Some Remarks and a Bibliography , 1995, Adaption and Learning in Multi-Agent Systems.

[85]  Michael Luck,et al.  Proceedings of the First International Conference on Multi-Agent Systems , 1995 .

[86]  Mandayam A. L. Thathachar,et al.  Local and Global Optimization Algorithms for Generalized Learning Automata , 1995, Neural Computation.

[87]  Sandip Sen,et al.  Adaption and Learning in Multi-Agent Systems , 1995, Lecture Notes in Computer Science.

[88]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[89]  Victor Lesser,et al.  Learning Experiments in a Heterogeneous Multi-agent System , 1995 .

[90]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[91]  Christos G. Cassandras,et al.  Application of Q-learning to elevator dispatcidng , 1996 .

[92]  Robert H. Crites,et al.  Multiagent reinforcement learning in the Iterated Prisoner's Dilemma. , 1996, Bio Systems.

[93]  C. Cassandras,et al.  Optimal dispatching control for elevator systems during uppeak traffic , 1996, Proceedings of 35th IEEE Conference on Decision and Control.

[94]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[95]  Victor Lesser,et al.  Learning Situation-specific Coordination in Generalized Partial Global Planning , 1996 .

[96]  K. Khalil On the Complexity of Decentralized Decision Making and Detection Problems , 2022 .