Concurrent decision making in markov decision processes

This dissertation investigates concurrent decision making and coordination in systems that can simultaneously execute multiple actions to perform tasks more efficiently. Concurrent decision-making is a fundamental problem in many areas of robotics, control, and computer science. In the field of Artificial Intelligence in particular, this problem is recognized as a formidable challenge. By concurrent decision making we refer to a class of problems that require agents to accomplish long-term goals by concurrently executing multiple activities. In general, the problem is difficult to solve as it requires learning and planning with a combinatorial set of interacting concurrent activities with uncertain outcomes that compete for limited resources in the system. The dissertation presents a general framework for modeling the concurrent decision making problem based on semi-Markov decision processes (SMDPs). Our approach is based on a centralized control formalism, where we assume a central control mechanism initiates, executes and monitors concurrent activities. This view also captures the type of concurrency that exists in single agent domains, where a single agent is capable of performing multiple activities simultaneously by exploiting the degrees of freedom (DOF) in the system. We present a set of coordination mechanisms employed by our model for monitoring the execution and termination of concurrent activities. Such coordination mechanisms incorporate various natural activity completion mechanisms based on the individual termination of each activity. We provide theoretical results that assert the correctness of the model semantics which allows us to apply standard SMDP learning and planning techniques for solving the concurrent decision making problem. SMDP solution methods do not scale to concurrent decision making systems with large degrees of freedom. This problem is a classic example of the curse of dimensionality in the action space, where the size of the set of concurrent activities exponentially grows as the system admits more degrees of freedom. To alleviate this problem, we develop a novel decision theoretic framework motivated by the coarticulation phenomenon investigated in speech and motor control research. The key idea in this approach is based on the fact that in many concurrent decision making problems, the overall objective of the problem can be viewed as concurrent optimization of a set of interacting and possibly simpler subgoals of the problem for which the agent has gained the necessary skills to achieve them. We show that by applying coarticulation to systems with excess degrees of freedom, concurrency is naturally generated. We present a set of theoretical results that characterizes the efficiency of the concurrent decision making based on the coarticulation framework when compared to the case in which the agent is allowed to only execute activities sequentially (i.e., no coarticulation). (Abstract shortened by UMI.)

[1]  Earl D. Sacerdoti,et al.  The Nonlinear Nature of Plans , 1975, IJCAI.

[2]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[3]  Paweł Cichosz Learning Multidimensional Control Actions From Delayed Reinforcements , 1995 .

[4]  Bhaskara Marthi,et al.  Concurrent Hierarchical Reinforcement Learning , 2005, IJCAI.

[5]  Glenn A. Iba,et al.  A heuristic approach to the discovery of macro-operators , 2004, Machine Learning.

[6]  Mary D Klein Breteler,et al.  Drawing sequences of segments in 3D: kinetic influences on arm configuration. , 2003, Journal of neurophysiology.

[7]  Roderic A. Grupen,et al.  Robust finger gaits from closed-loop controllers , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  Stuart J. Russell,et al.  Q-Decomposition for Reinforcement Learning Agents , 2003, ICML.

[9]  M. A. Arbib,et al.  Models of Trajectory Formation and Temporal Interaction of Reach and Grasp. , 1993, Journal of motor behavior.

[10]  Ronald A. Howard,et al.  Dynamic Probabilistic Systems , 1971 .

[11]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[12]  Peter Norvig,et al.  Artificial intelligence - a modern approach, 2nd Edition , 2003, Prentice Hall series in artificial intelligence.

[13]  R. L. Keeney,et al.  Decisions with Multiple Objectives: Preferences and Value Trade-Offs , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[14]  Rodney A. Brooks,et al.  A Robust Layered Control Syste For A Mobile Robot , 2022 .

[15]  Roderic A. Grupen,et al.  A control basis for multilegged walking , 1996, Proceedings of IEEE International Conference on Robotics and Automation.

[16]  John J. Craig Zhu,et al.  Introduction to robotics mechanics and control , 1991 .

[17]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[18]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[19]  T. Vincent,et al.  Nonlinear and Optimal Control Systems , 1997 .

[20]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[21]  A. Pellionisz,et al.  Tensor network theory of the metaorganization of functional geometries in the central nervous system , 1985, Neuroscience.

[22]  Ronen I. Brafman,et al.  Planning with Concurrent Interacting Actions , 1997, AAAI/IAAI.

[23]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[24]  Geoffrey J. Gordon,et al.  Distributed Planning in Hierarchical Factored MDPs , 2002, UAI.

[25]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[26]  R. Cohen,et al.  Where grasps are made reveals how grasps are planned: generation and recall of motor plans , 2004, Experimental Brain Research.

[27]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[28]  Andrew G. Barto,et al.  Robot Weightlifting By Direct Policy Search , 2001, IJCAI.

[29]  Robert Platt,et al.  Coarticulation in Markov Decision Processes , 2004, NIPS.

[30]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[31]  Ronen I. Brafman,et al.  Partial-Order Planning with Concurrent Interacting Actions , 2011, J. Artif. Intell. Res..

[32]  Rina Dechter,et al.  Bucket elimination: A unifying framework for probabilistic inference , 1996, UAI.

[33]  Richard W. Prager,et al.  A Modular Q-Learning Architecture for Manipulator Task Decomposition , 1994, ICML.

[34]  M. Jeannerod Intersegmental coordination during reaching at natural visual objects , 1981 .

[35]  Craig Boutilier,et al.  Stochastic dynamic programming with factored representations , 2000, Artif. Intell..

[36]  J F Soechting,et al.  Organization of sequential typing movements. , 1992, Journal of neurophysiology.

[37]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[38]  Csaba Szepesvári,et al.  Multi-criteria Reinforcement Learning , 1998, ICML.

[39]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[40]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[41]  Eithan Ephrati,et al.  Divide and Conquer in Multi-Agent Planning , 1994, AAAI.

[42]  M. Veloso,et al.  Nonlinear Planning with Parallel Resource Allocation , 1990 .

[43]  Geoffrey E. Hinton,et al.  Reinforcement learning for factored Markov decision processes , 2002 .

[44]  Andrew G. Barto,et al.  Using relative novelty to identify useful temporal abstractions in reinforcement learning , 2004, ICML.

[45]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[46]  Ralph L. Keeney,et al.  Decisions with multiple objectives: preferences and value tradeoffs , 1976 .

[47]  Chitta Baral,et al.  Reasoning About Effects of Concurrent Actions , 1997, J. Log. Program..

[48]  A. Liegeois,et al.  Automatic supervisory control of the configuration and behavior of multi-body mechanisms , 1977 .

[49]  Gerhard Weiss,et al.  Multiagent systems: a modern approach to distributed artificial intelligence , 1999 .

[50]  D. Koller,et al.  Planning under uncertainty in complex structured environments , 2003 .

[51]  Christer Bäckström Finding Least Constrained Plans and Optimal Parallel Executions is Harder than We Thought , 1994 .

[52]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning in Stochastic Games , 1999, ICML 1999.

[53]  James S. Albus,et al.  Brains, behavior, and robotics , 1981 .

[54]  J. Kelso,et al.  Skilled actions: a task-dynamic approach. , 1987, Psychological review.

[55]  Sridhar Mahadevan,et al.  Decision-Theoretic Planning with Concurrent Temporally Extended Actions , 2001, UAI.

[56]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[57]  M. Rosenstein,et al.  Supervised Learning Combined with an Actor-Critic Architecture TITLE2: , 2002 .

[58]  Alicia P. Wolfe,et al.  Identifying useful subgoals in reinforcement learning by local graph partitioning , 2005, ICML.

[59]  Sridhar Mahadevan,et al.  Hierarchical multi-agent reinforcement learning , 2001, AGENTS '01.

[60]  S. Chipman The Remembered Present: A Biological Theory of Consciousness , 1990, Journal of Cognitive Neuroscience.

[61]  Roderic A. Grupen,et al.  A hybrid architecture for adaptive robot control , 2000 .

[62]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[63]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[64]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[65]  Robert Platt,et al.  Nullspace composition of control laws for grasping , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[66]  K. J. Cole,et al.  Control of multimovement coordination: sensorimotor mechanisms in speech motor programming. , 1984, Journal of motor behavior.

[67]  Gregory R. Andrews,et al.  Concurrent programming - principles and practice , 1991 .

[68]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[69]  Michael Gelfond,et al.  Representing Actions in Extended Logic Programming , 1992, JICSLP.

[70]  Michael I. Jordan,et al.  Optimal feedback control as a theory of motor coordination , 2002, Nature Neuroscience.

[71]  J. Foley The co-ordination and regulation of movements , 1968 .

[72]  Craig A. Knoblock Generating Parallel Execution Plans with a Partial-order Planner , 1994, AIPS.

[73]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[74]  Andrew G. Barto,et al.  Lyapunov-Constrained Action Sets for Reinforcement Learning , 2001, ICML.

[75]  Michail G. Lagoudakis,et al.  Coordinated Reinforcement Learning , 2002, ICML.

[76]  Ming Tan,et al.  Multi-Agent Reinforcement Learning: Independent versus Cooperative Agents , 1997, ICML.

[77]  Scott T Grafton,et al.  From 'acting on' to 'acting with': the functional anatomy of object-oriented action schemata. , 2003, Progress in brain research.

[78]  Raymond Reiter,et al.  Natural Actions, Concurrency and Continuous Time in the Situation Calculus , 1996, KR.

[79]  Andrew S. Tanenbaum,et al.  Operating systems - design and implementation, 3rd Edition , 2005 .

[80]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[81]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[82]  Andrew G. Barto,et al.  Heuristic Search in Infinite State Spaces Guided by Lyapunov Analysis , 2001, IJCAI.

[83]  Craig Boutilier,et al.  Approximate Value Trees in Structured Dynamic Programming , 1996, ICML.

[84]  Andrew W. Moore,et al.  An Introduction to Reinforcement Learning , 1995 .

[85]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[86]  Michael T. Rosenstein,et al.  Learning to exploit dynamics for robot motor coordination , 2003 .

[87]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[88]  J. F. Soechting,et al.  Anticipatory and sequential motor control in piano playing , 1997, Experimental Brain Research.

[89]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[90]  Kee-Eung Kim,et al.  Solving Very Large Weakly Coupled Markov Decision Processes , 1998, AAAI/IAAI.

[91]  Henry A. Kautz,et al.  Constraint propagation algorithms for temporal reasoning: a revised report , 1989 .

[92]  M. Wiesendanger,et al.  Coordination of bowing and fingering in violin playing. , 2005, Brain research. Cognitive brain research.

[93]  Michael I. Jordan,et al.  An Introduction to Graphical Models , 2001 .

[94]  Theodore J. Perkins,et al.  Lyapunov methods for safe intelligent agent design , 2002 .

[95]  M. Wiesendanger,et al.  Toward a physiological understanding of human dexterity. , 2001, News in physiological sciences : an international journal of physiology produced jointly by the International Union of Physiological Sciences and the American Physiological Society.

[96]  Tsuneo Yoshikawa,et al.  Analysis and Control of Robot Manipulators with Redundancy , 1983 .

[97]  Sridhar Mahadevan,et al.  Proto-value functions: developmental reinforcement learning , 2005, ICML.

[98]  Sridhar Mahadevan,et al.  Coarticulation: an approach for generating concurrent plans in Markov decision processes , 2005, ICML.

[99]  Raymond D. Kent,et al.  Coarticulation in recent speech production models , 1977 .

[100]  Yoshihiko Nakamura,et al.  Advanced robotics - redundancy and optimization , 1990 .

[101]  Pierre Régnier,et al.  Complete Determination of Parallel Actions and Temporal Optimization in Linear Plans of Action , 1991, EWSP.

[102]  Suzanne Daneau,et al.  Action , 2020, Remaking the Real Economy.

[103]  Michael Thielscher,et al.  Representing Concurrent Actions and Solving Conflicts , 1994, Log. J. IGPL.

[104]  Roderic A. Grupen,et al.  Coordinated teams of reactive mobile platforms , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[105]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[106]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[107]  Mausam,et al.  Solving Concurrent Markov Decision Processes , 2004, AAAI.

[108]  Håkan L. S. Younes,et al.  A Formalism for Stochastic Decision Processes with Asynchronous Events , 2004 .

[109]  Mausam,et al.  Concurrent Probabilistic Temporal Planning , 2005, ICAPS.

[110]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[111]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[112]  Andrew G. Barto,et al.  PolicyBlocks: An Algorithm for Creating Useful Macro-Actions in Reinforcement Learning , 2002, ICML.

[113]  E. Bizzi,et al.  Theoretical and Experimental Perspectives on Arm Trajectory Formation: A Distributed Model of Motor Redundancy , 1988 .

[114]  Sridhar Mahadevan,et al.  Learning to Take Concurrent Actions , 2002, NIPS.

[115]  Geoffrey E. Hinton,et al.  Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task , 2000, NIPS.

[116]  Henry A. Kautz,et al.  Reasoning about plans , 1991, Morgan Kaufmann series in representation and reasoning.

[117]  Satinder P. Singh,et al.  How to Dynamically Merge Markov Decision Processes , 1997, NIPS.

[118]  Roderic A. Grupen,et al.  A Developmental Organization for Robot Behavior , 2005 .

[119]  Andrew S. Tanenbaum,et al.  Operating systems: design and implementation , 1987, Prentice-Hall software series.

[120]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[121]  Earl David Sacerdoti,et al.  A Structure for Plans and Behavior , 1977 .

[122]  Sridhar Mahadevan,et al.  Hierarchical Multiagent Reinforcement Learning , 2004 .

[123]  Daphne Koller,et al.  Computing Factored Value Functions for Policies in Structured MDPs , 1999, IJCAI.