ADAPTIVE RESOURCE CONTROL Machine Learning Approaches to Resource Allocation in Uncertain and Changing Environments

The dissertation aims at studying resource allocation problems (RAPs) in uncertain and changing environments. In order to do this, first a brief introduction to the motivations and classical RAPs is given in Chapter 1, followed by a section on Markov decision processes (MDPs) which constitute the basis of the approach. The core of the thesis consists of two parts, the first deals with uncertainties, namely, with stochastic RAPs, while the second studies the effects of changes in the environmental dynamics on learning algorithms. Chapter 2, the first core part, investigates stochastic RAPs with scarce, reusable resources and non-preemtive, interconnected tasks having temporal extensions. These RAPs are natural generalizations of several standard resource management problems, such as scheduling and transportation ones. First, reactive solutions are considered and defined as policies of suitably reformulated MDPs. It is highlighted that this reformulation has several favorable properties, such as it has finite state and action spaces, it is aperiodic, hence all policies are proper and the space of policies can be safely restricted. Proactive solutions are also proposed and defined as policies of special partially observable MDPs. Next, reinforcement learning (RL) methods, such as fitted Q-learning, are suggested for computing a policy. In order to compactly maintain the value function, two representations are studied: hash tables and support vector regression (SVR), particularly, ν-SVRs. Several additional improvements, such as the application of rollout algorithms in the initial phases, action space decomposition, task clustering and distributed sampling are investigated, as well. Chapter 3, the second core part, studies the possibility of applying value function based RL methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function and the value function of a fixed control policy Lipschitz continuously depend on the immediate-cost function and the transitionprobability function, assuming a discounted MDP. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the transitionprobability function and the immediate-cost function may vary over time, but the changes must be asymptotically bounded. Then, learning in changing environments is investigated. A general relaxed convergence theorem for stochastic iterative algorithms is presented and illustrated through three classical examples: value iteration, Q-learning and TD-learning. Finally, in Chapter 4, results of numerical experiments on both benchmark and industryrelated problems are shown. The effectiveness of the proposed adaptive resource allocation approach as well as learning in presence of disturbances and changes are demonstrated.

[1]  Balázs Csanád Csáji,et al.  On the Automation of Similarity Information Maintenance in Flexible Query Answering Systems , 2004, DEXA.

[2]  László Monostori,et al.  Stochastic Reactive Production Scheduling by Multi-agent Based Asynchronous Approximate Dynamic Programming , 2005, CEEMAS.

[3]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[4]  László Monostori,et al.  Production structures as complex adaptive systems , 2007 .

[5]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Suboptimal Control: A Survey from ADP to MPC , 2005, Eur. J. Control.

[6]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[7]  Botond Kádár,et al.  The role of adaptive agents in distributed manufacturing , 2002 .

[8]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[9]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[10]  Eduardo D. Sontag,et al.  Mathematical Control Theory: Deterministic Finite Dimensional Systems , 1990 .

[11]  Wei-Min Shen,et al.  Dynamic Distributed Resource Allocation: A Distributed Constraint Satisfaction Approach , 2001, ATAL.

[12]  Botond Kádár,et al.  Improving Multi-agent Based Scheduling by Neurodynamic Programming , 2003, HoloMAS.

[13]  Edmund H. Durfee,et al.  Resource Allocation Among Agents with MDP-Induced Preferences , 2006, J. Artif. Intell. Res..

[14]  László Monostori,et al.  Adaptive Sampling Based Large-Scale Stochastic Resource Control , 2006, AAAI.

[15]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[16]  Andrew W. Moore,et al.  Value Function Based Production Scheduling , 1998, ICML.

[17]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[18]  András Lörincz,et al.  MDPs: Learning in Varying Environments , 2003, J. Mach. Learn. Res..

[19]  J. Christopher Beck,et al.  A Global Constraint for Total Weighted Completion Time , 2007, CPAIOR.

[20]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[21]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[22]  Lewis J. Pinson,et al.  Fundamentals of OOP and Data Structures in Java: Complexity of Algorithms , 2000 .

[23]  Botond Kádár,et al.  Real-time, cooperative enterprises for customised mass production , 2009, Int. J. Comput. Integr. Manuf..

[24]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[25]  Barbara Hammer,et al.  Improving iterative repair strategies for scheduling with the SVM , 2003, ESANN.

[26]  Mehmet Emin Aydin,et al.  Dynamic job-shop scheduling using reinforcement learning agents , 2000, Robotics Auton. Syst..

[27]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[28]  Martin A. Riedmiller,et al.  A Neural Reinforcement Learning Approach to Learn Local Dispatching Policies in Production Scheduling , 1999, IJCAI.

[29]  László Monostori,et al.  Complex Adaptive Systems (CAS) Approach to Production Systems and Organisations , 2008 .

[30]  Jason Brownlee,et al.  Complex adaptive systems , 2007 .

[31]  Csaba Szepesvári,et al.  Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[32]  László Monostori,et al.  Emergent synthesis methodologies for manufacturing , 2001 .

[33]  Manolis I. A. Lourakis,et al.  Smart sensor based vision system for automated processes , 2007 .

[34]  J. Hatvany,et al.  Intelligent Manufacturing Systems— A Tentative Forecast , 1978 .

[35]  Benjamin Van Roy,et al.  A neuro-dynamic programming approach to retailer inventory management , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[36]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[37]  Luca Maria Gambardella,et al.  Effective Neighborhood Functions for the Flexible Job Shop Problem , 1998 .

[38]  Botond Kádár,et al.  Learning and cooperation in a distributed market-based production control system , 2004 .

[39]  William Aspray,et al.  Papers of John Von Neumann on computing and computer theory, Vol 12 , 1986 .

[40]  Albert D. Baker,et al.  A survey of factory control algorithms that can be implemented in a multi-agent heterarchy: Dispatching, scheduling, and pull , 1998 .

[41]  László Monostori,et al.  A Market Approach to Holonic Manufacturing , 1996 .

[42]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[43]  Francisco Salem-Silva,et al.  Estimates of stability of Markov control processes with unbounded costs , 2000, Kybernetika.

[44]  László Monostori,et al.  Complexity-based modeling of reconfigurable collaborations in production industry , 2008 .

[45]  John R. Koza,et al.  Hidden Order: How Adaptation Builds Complexity. , 1995, Artificial Life.

[46]  Michael Pinedo,et al.  Scheduling: Theory, Algorithms, and Systems , 1994 .

[47]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[48]  Csaba Szepesvári,et al.  A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.

[49]  Botond Kádár,et al.  Adaptation and Learning in Distributed Production Control , 2004 .

[50]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[51]  Botond Kádár,et al.  Reinforcement learning in a distributed market-based production control system , 2006, Adv. Eng. Informatics.

[52]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[53]  Nong Ye,et al.  Comparison of distributed methods for resource allocation , 2005 .

[54]  László Monostori,et al.  Value Function Based Reinforcement Learning in Changing Markovian Environments , 2008, J. Mach. Learn. Res..

[55]  V. Bulitko,et al.  Learning in Real-Time Search: A Unifying Framework , 2011, J. Artif. Intell. Res..

[56]  Eberhard E. Bischoff,et al.  Weight distribution considerations in container loading , 1999, Eur. J. Oper. Res..

[57]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[58]  László Monostori,et al.  Agent-based systems for manufacturing , 2006 .

[59]  L. Monostori,et al.  Adaptive Stochastic Resource Control: A Machine Learning Approach , 2008, J. Artif. Intell. Res..

[60]  László Monostori,et al.  STOCHASTIC APPROXIMATE SCHEDULING BY NEURODYNAMIC LEARNING , 2005 .

[61]  Luc Bongaerts,et al.  Reference architecture for holonic manufacturing systems: PROSA , 1998 .

[62]  Wolfgang J. Runggaldier,et al.  A robustness result for stochastic control , 2002, Syst. Control. Lett..

[63]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[64]  M.D. Atkinson Complexity of Algorithms , 1996 .

[65]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[66]  László Monostori,et al.  Adaptive algorithms in distributed resource allocation , 2006 .

[67]  A. Shwartz,et al.  Handbook of Markov decision processes : methods and applications , 2002 .

[68]  Raúl Montes-de-Oca,et al.  Estimates for perturbations of general discounted Markov control chains , 2003 .

[69]  D. Aberdeen,et al.  A ( Revised ) Survey of Approximate Methods for Solving Partially Observable Markov Decision Processes , 2003 .

[70]  J. Christopher Beck,et al.  Proactive Algorithms for Job Shop Scheduling with Probabilistic Durations , 2011, J. Artif. Intell. Res..

[71]  László Monostori,et al.  Stochastic Dynamic Production Control by Neurodynamic Programming , 2006 .

[72]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[73]  Eugene L. Lawler,et al.  Sequencing and scheduling: algorithms and complexity , 1989 .

[74]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[75]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[76]  László Monostori,et al.  AI and machine learning techniques for managing complexity, changes and uncertainties in manufacturing , 2003 .

[77]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[78]  Hendrik Van Brussel,et al.  Multi-agent coordination and control using stigmergy , 2004, Comput. Ind..

[79]  Panganamala Ramana Kumar,et al.  Distributed scheduling of flexible manufacturing systems: stability and performance , 1994, IEEE Trans. Robotics Autom..

[80]  András Lörincz,et al.  Module-Based Reinforcement Learning: Experiments with a Real Robot , 1998, Machine Learning.

[81]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[82]  C. White Application of two inequality results for concave functions to a stochastic optimization problem , 1976 .

[83]  Johann L. Hurink,et al.  Tabu search for the job-shop scheduling problem with multi-purpose machines , 1994 .

[84]  Warren B. Powell,et al.  A Distributed Decision-Making Structure for Dynamic Resource Allocation Using Nonlinear Functional Approximations , 2005, Oper. Res..