Utilising policy types for effective ad hoc coordination in multiagent systems

This thesis is concerned with the ad hoc coordination problem. Therein, the goal is to design an autonomous agent which can achieve high flexibility and efficiency in a multiagent system that admits no prior coordination between the designed agent and the other agents. Flexibility describes the agent’s ability to solve its task with a variety of other agents in the system; efficiency is the relation between the agent’s payoffs and time needed to solve the task; and no prior coordination means that the agent does not a priori know how the other agents behave. This problem is relevant for a number of practical applications, including human-machine interaction tasks, such as adaptive user interfaces, robotic elderly care, and automated trading agents. Motivated by this problem, the central idea studied in this thesis is to utilise a set of policies, or types, to characterise the behaviour of other agents. Specifically, the idea is to reduce the complexity of the interaction problem by assuming that the other agents draw their latent type from some known or hypothesised space of types, and that the assignment of types is governed by an unknown distribution. Based on the current interaction history, we can form posterior beliefs about the relative likelihood of types. These beliefs, combined with the future predictions of the types, can then be used in a planning procedure to compute optimal responses. The aim of this thesis is to study the potential and limitations of this idea in the context of ad hoc coordination. We formulate the ad hoc coordination problem using a game-theoretic model called the stochastic Bayesian game. Based on this model, we derive a canonical algorithmic description of the idea outlined above, called Harsanyi-Bellman Ad Hoc Coordination (HBA). The practical potential of HBA is demonstrated in two case studies, including a human-machine experiment and a simulated logistics domain. We formulate basic ways to incorporate evidence (i.e. observed actions) into posterior beliefs and analyse the conditions under which the posterior beliefs converge to the true distribution of types. Furthermore, we study the impact of prior beliefs over types (that is, before any actions are observed) on the long-term performance of HBA, and show empirically that automatic methods can compute prior beliefs with consistent performance effects. For hypothesised (i.e. “guessed”) type spaces, we analyse the relations between hypothesised and true type spaces under which HBA is still guaranteed to solve its task, despite inaccuracies in hypothesised types. Finally, we show how HBA can perform an automatic statistical analysis to decide whether to reject its behavioural hypothesis, i.e. the combination of posterior beliefs and types.

[1]  Gal A. Kaminka,et al.  Incorporating Observer Biases in Keyhole Plan Recognition (Efficiently!) , 2007, AAAI.

[2]  Gal Kaminka,et al.  Distinguishing Between Intentional and Unintentional Sequences of Actions , 2009 .

[3]  S. Hart,et al.  A Reinforcement Procedure Leading to Correlated Equilibrium , 2001 .

[4]  Michael H. Bowling,et al.  Coordination and Adaptation in Impromptu Teams , 2005, AAAI.

[5]  Edmund H. Durfee,et al.  Rational Coordination in Multi-Agent Environments , 2000, Autonomous Agents and Multi-Agent Systems.

[6]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[7]  H. Peyton Young,et al.  Strategic Learning and Its Limits , 2004 .

[8]  Barbara Messing,et al.  An Introduction to MultiAgent Systems , 2002, Künstliche Intell..

[9]  Michael H. Bowling,et al.  Bayes' Bluff: Opponent Modelling in Poker , 2005, UAI 2005.

[10]  Jacob W. Crandall,et al.  Towards Minimizing Disappointment in Repeated Games , 2014, J. Artif. Intell. Res..

[11]  Milind Tambe,et al.  Towards Flexible Teamwork , 1997, J. Artif. Intell. Res..

[12]  J. Nash Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[13]  H. Peyton Young,et al.  Learning, hypothesis testing, and Nash equilibrium , 2003, Games Econ. Behav..

[14]  Prashant Doshi,et al.  Exact solutions of interactive POMDPs using behavioral equivalence , 2006, AAMAS '06.

[15]  Sandip Sen,et al.  Reaching pareto-optimality in prisoner’s dilemma using conditional joint action learning , 2007, Autonomous Agents and Multi-Agent Systems.

[16]  Alexander I. Rudnicky,et al.  Dynamically Formed Human-Robot Teams Performing Coordinated Tasks , 2006, AAAI Spring Symposium: To Boldly Go Where No Human-Robot Team Has Gone Before.

[17]  Sarit Kraus,et al.  Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination , 2010, AAAI.

[18]  Robert P. Goldman,et al.  A Bayesian Model of Plan Recognition , 1993, Artif. Intell..

[19]  Ya'akov Gal,et al.  Learning Social Preferences in Games , 2004, AAAI.

[20]  Subramanian Ramamoorthy,et al.  On Convergence and Optimality of Best-Response Learning with Policy Types in Multiagent Systems , 2014, UAI.

[21]  E. Kalai,et al.  Rational Learning Leads to Nash Equilibrium , 1993 .

[22]  Aki Vehtari,et al.  A survey of Bayesian predictive methods for model assessment, selection and comparison , 2012 .

[23]  Prashant Doshi,et al.  Monte Carlo Sampling Methods for Approximating Interactive POMDPs , 2014, J. Artif. Intell. Res..

[24]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[25]  Sandra Carberry,et al.  Techniques for Plan Recognition , 2001, User Modeling and User-Adapted Interaction.

[26]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[27]  Subramanian Ramamoorthy,et al.  A Game-theoretic Model and Best-response Learning Method for Ad Hoc Coordination in Multiagent Systems (Extended Abstract) , 2013 .

[28]  Sarit Kraus,et al.  Empirical evaluation of ad hoc teamwork in the pursuit domain , 2011, AAMAS.

[29]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[30]  Sarit Kraus,et al.  Collaborative Plans for Complex Group Action , 1996, Artif. Intell..

[31]  Sarit Kraus,et al.  To teach or not to teach?: decision making under uncertainty in ad hoc teams , 2010, AAMAS.

[32]  John J. Nitao,et al.  Towards Applying Interactive POMDPs to Real-World Adversary Modeling , 2010, IAAI.

[33]  Samuel Barrett,et al.  Making Friends on the Fly: Advances in Ad Hoc Teamwork , 2015, Studies in Computational Intelligence.

[34]  Kim G. Larsen,et al.  Bisimulation through Probabilistic Testing , 1991, Inf. Comput..

[35]  Jacob W. Crandall,et al.  An Empirical Study on the Practical Impact of Prior Beliefs over Policy Types , 2015, AAAI.

[36]  J. Jordan,et al.  Bayesian learning in normal form games , 1991 .

[37]  Prashant Doshi,et al.  On the Difficulty of Achieving Equilibrium in Interactive POMDPs , 2006, AI&M.

[38]  Noa Agmon,et al.  Leading ad hoc agents in joint action settings with multiple teammates , 2012, AAMAS.

[39]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[40]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[41]  W. A. Wagenaar Generation of random sequences by human subjects: A critical survey of literature. , 1972 .

[42]  Noa Agmon,et al.  Modeling uncertainty in leading ad hoc teams , 2014, AAMAS.

[43]  Noa Agmon,et al.  Role-Based Ad Hoc Teamwork , 2011, AAAI.

[44]  Yue Gao,et al.  Learning more powerful test statistics for click-based retrieval evaluation , 2010, SIGIR.

[45]  Edwin T. Jaynes Prior Probabilities , 2010, Encyclopedia of Machine Learning.

[46]  Prashant Doshi,et al.  Modeling recursive reasoning by humans using empirically informed interactive POMDPs , 2010, AAMAS.

[47]  Sarit Kraus,et al.  The Evolution of Sharedplans , 1999 .

[48]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[49]  Sandip Sen,et al.  Towards a pareto-optimal solution in general-sum games , 2003, AAMAS '03.

[50]  John C. Harsanyi,et al.  Games with Incomplete Information Played by "Bayesian" Players, I-III: Part I. The Basic Model& , 2004, Manag. Sci..

[51]  Cosma Rohilla Shalizi,et al.  Philosophy and the practice of Bayesian statistics. , 2010, The British journal of mathematical and statistical psychology.

[52]  Christel Baier,et al.  Polynomial Time Algorithms for Testing Probabilistic Bisimulation and Simulation , 1996, CAV.

[53]  P. J. Gmytrasiewicz,et al.  A Framework for Sequential Planning in Multi-Agent Settings , 2005, AI&M.

[54]  Sarit Kraus,et al.  Teamwork with Limited Knowledge of Teammates , 2013, AAAI.

[55]  M. Dufwenberg Game theory. , 2011, Wiley interdisciplinary reviews. Cognitive science.

[56]  Bengt Jonsson,et al.  A logic for reasoning about time and reliability , 1990, Formal Aspects of Computing.

[57]  John Nachbar,et al.  Beliefs in Repeated Games , 2003 .

[58]  Drew Fudenberg,et al.  Learning to Play Bayesian Games , 2001, Games Econ. Behav..

[59]  S. Brams Theory of moves , 1993 .

[60]  Yifeng Zeng,et al.  Graphical models for interactive POMDPs: representations and solutions , 2009, Autonomous Agents and Multi-Agent Systems.

[61]  J. I The Design of Experiments , 1936, Nature.

[62]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[63]  J. Harsanyi Games with Incomplete Information Played by “Bayesian” Players Part II. Bayesian Equilibrium Points , 1968 .

[64]  M. J. Bayarri,et al.  P Values for Composite Null Models , 2000 .

[65]  Ingrid Zukerman,et al.  Bayesian Models for Keyhole Plan Recognition in an Adventure Game , 2004, User Modeling and User-Adapted Interaction.

[66]  O. H. Brownlee,et al.  ACTIVITY ANALYSIS OF PRODUCTION AND ALLOCATION , 1952 .

[67]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[68]  H. Fischer A History of the Central Limit Theorem: From Classical to Modern Probability Theory , 2010 .

[69]  H. W. Kuhn,et al.  11. Extensive Games and the Problem of Information , 1953 .

[70]  Joscha Bach,et al.  Recognizing and Predicting Agent Behavior with Case Based Reasoning , 2003, RoboCup.

[71]  John E. Laird,et al.  Learning to play , 2009 .

[72]  Shaul Markovitch,et al.  Learning Models of Opponent's Strategy Game Playing , 1993 .

[73]  D. Rubin Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician , 1984 .

[74]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[75]  Subramanian Ramamoorthy,et al.  Comparative evaluation of MAL algorithms in a diverse set of ad hoc team problems , 2012, AAMAS.

[76]  J. Harsanyi Games with Incomplete Information Played by 'Bayesian' Players, Part III. The Basic Probability Distribution of the Game , 1968 .

[77]  H. J. van den Herik,et al.  Opponent Modelling and Commercial Games , 2005 .

[78]  Itzhak Gilboa,et al.  A theory of case-based decisions , 2001 .

[79]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[80]  George E. P. Box,et al.  Sampling and Bayes' inference in scientific modelling and robustness , 1980 .

[81]  A. Tversky,et al.  Prospect theory: an analysis of decision under risk — Source link , 2007 .

[82]  Boris Ryabko,et al.  On hypotheses testing for ergodic processes , 2008, 2008 IEEE Information Theory Workshop.

[83]  Peter Stone,et al.  Cooperating with Unknown Teammates in Complex Domains: A Robot Soccer Case Study of Ad Hoc Teamwork , 2015, AAAI.

[84]  Peter Stone,et al.  Cooperating with a markovian ad hoc teammate , 2013, AAMAS.

[85]  J. Bernardo Reference Posterior Distributions for Bayesian Inference , 1979 .

[86]  Akira Okada Perfect Bayesian Equilibrium and Sequential Equilibrium , 2011 .

[87]  Subramanian Ramamoorthy,et al.  A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems , 2013, AAMAS.

[88]  J. Berger,et al.  Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence , 1987 .

[89]  Prashant Doshi,et al.  Generalized Point Based Value Iteration for Interactive POMDPs , 2008, AAAI.

[90]  Peter Stone,et al.  Influencing a Flock via Ad Hoc Teamwork , 2014, ANTS Conference.

[91]  Koen V. Hindriks,et al.  Opponent modelling in automated multi-issue negotiation using Bayesian learning , 2008, AAMAS.

[92]  A. Azzalini A class of distributions which includes the normal ones , 1985 .

[93]  A. O'Hagan,et al.  Bayes estimation subject to uncertainty about parameter constraints , 1976 .

[94]  Xiao-Li Meng,et al.  Posterior Predictive $p$-Values , 1994 .

[95]  Timo Steffens Feature-Based Declarative Opponent-Modelling , 2003, RoboCup.

[96]  Craig Boutilier,et al.  Coordination in multiagent reinforcement learning: a Bayesian approach , 2003, AAMAS '03.

[97]  Robert P. Goldman,et al.  Plan, Activity, and Intent Recognition: Theory and Practice , 2014 .

[98]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[99]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[100]  Philip Wolfe,et al.  Contributions to the theory of games , 1953 .

[101]  Gal A. Kaminka,et al.  Integration of Coordination Mechanisms in the BITE Multi-Robot Architecture , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.