We examined the behavior of reinforcement-learning algorithms in a set of two-player stochastic games played on a grid. These games were selected because they include both cooperative and competitive elements, highlighting the importance of adaptive collaboration between the players. We found that pairs of learners were surprisingly good at discovering stable mutually beneficial behavior when such behaviors existed. However, the performance of learners was significantly impacted by their other-regarding preferences. We found similar patterns of results in games involving human–human and human–agent pairs. The field of reinforcement learning (Sutton and Barto 1998) is concerned with agents that improve their behavior in sequential environments through interaction. One of the best known and most versatile reinforcement-learning (RL) algorithms is Q-learning (Watkins and Dayan 1992), which is known to converge to optimal decisions in environments that can be characterized as Markov decision processes. Q-learning is best suited for single-agent environments; nevertheless, it has been applied in multi-agent environments (Sandholm and Crites 1995; Gomes and Kowalczyk 2009; Wunder, Littman, and Babes 2010), including nonzero-sum stochastic games, with varying degrees of success. Nash-Q (Hu and Wellman 2003) is an attempt to adapt Q-learning to the general-sum setting, but its update rule is inefficient and it lacks meaningful convergence guarantees (Bowling 2000; Littman 2001). Correlated-Q (Greenwald and Hall 2003) is an improvement over Nash-Q in that, in exchange for access to a correlating device, its update rule is computationally efficient. However, there exist environments in which correlated-Q also does not converge (Zinkevich, Greenwald, and Littman 2005). Minimax-Q (Littman 1994a) converges to provably optimal decisions, but only in zero-sum Markov games. Likewise, Friend-Q and FoeQ (Littman 2001) provably converge, but only to optimal decisions in purely cooperative and purely competitive games, respectively. One significant shortcoming of the aforementioned multiagent learning algorithms is that they define their updates in Copyright c © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. a way that makes assumptions about their opponents without actually factoring in their opponents’ observed behavior. In a sense, they are too stubborn. In contrast, single-agent learning algorithms like Q-learning are too flexible—they simply adapt to their opponents without consideration of how their behavior will impact the opponent. What is lacking in these existing algorithms is the ability to negotiate a mutually beneficial outcome (Gal et al. 2004). Algorithms have been designed that seek a best response against a fixed player and a mutually beneficial response against like players (Conitzer and Sandholm 2007; Bowling and Veloso 2002). Others attempt to “lead” a learning opponent to beneficial behavior (Littman and Stone 2001). In this work, we return to the investigation of the behavior of single-agent Q-learning in multi-agent environments. Ultimately, a major goal for developing machine agents that act intelligently in multi-agent scenarios is to apply them to real-world problems. Humans are already agents that machine agents interact with in some current multiagent environments (such as the stock market and online advertising auctions). Successfully expanding the scope of applications where multi-agent learning can be applied in the real world necessitates studying how these agents interact with human agents. A machine agent that interacts optimally against other machine agents, but not against human agents, is not likely to be effective in environments that include human agents. Further, one major goal of developing machine agents is for them to solve tasks in collaboration with human agents. Given the controversial nature of rationality assumptions for human agents (Kahneman, Slovic, and Tversky 1982), a machine agent that plans its collaboration by assuming the human agent will act rationally (optimally) is unlikely to be successful in collaborating with the human agent. Thus, in this paper, we investigate how human agents interact with each other, and how humans interact with fair and selfish reinforcement-learning agents. Our work is inspired by results in behavioral game theory (Camerer 2003), where researchers have explored multiagent decision-making in cases where each agent is maximizing a utility that combines their own objective utility and, to some lesser extent, other-regarding preferences that penalize inequity between agents. Our approach goes beyond earlier attempts to nudge agents toward more cooperative behavior (Babes, Munoz de Cote, and Littman 2008) and instead provides a general framework that considers both objective and subjective rewards (Singh et al. 2010) in the form of other-regarding preferences. We investigate the behavior of this approach in machine-machine and machinehuman interactions. Our main contribution is an exploration of how incorporating others’ preferences into the agents’ world views in multi-agent decision making improves individual performance during the learning phase, leads to desirable, robust policies that are both defensive and fair, and improves joint-results when interacting with humans, without sacrificing individual performance. Experimental Testbed Our experimental testbed included several two-agent grid games. These games are designed to vary the level of coordination required, while at the same time allowing agents to defend against uncooperative partners. A grid game is a game played by two agents on a grid, in which each agent has a goal. See, for example, Figure 1, which is a 3×5 grid in which the two agents’ initial positions are one another’s goals: Orange begins in position (1,2), Blue’s goal; and Blue begins in position (5,2), Orange’s goal. We refer to grid positions using x-y coordinates, with (1, 1) as the bottom left position. One grid-game match proceeds in rounds, and each round consists of multiple turns. On each turn, the agents choose one of five actions (north, south, east, west, or wait), which are then executed simultaneously. In the most basic setup, agents transition deterministically, and there is no tiebreaking when two agents collide.1 Instead, if their chosen actions would result in a collision with one another, neither agent moves. A round ends when either (or both) players move into their goal, or when a maximum number of turns has been taken. As mentioned above, our grid games are specifically designed to prevent the agents from reaching their goals without coordinating their behavior. Consequently, one approach is for an agent to cooperate blindly with its opponent by simply moving out of the opponent’s way, and hoping the opponent then waits for the agent to catch up. However, such strategies can be exploited by uncooperative ones that proceed directly to the goal as soon as their path is unobstructed. To distinguish “unsafe” from “safe” cooperation, we devised a new categorization for strategies in our grid games. Specifically, we call strategies that allow for cooperation, while at the same time maintain a defensive position in the event that the other agent is uncooperative, cooperative defensive strategies. More formally, an agent’s strategy is cooperative (C) if it is one that allows both it and its opponent to reach their goals, while an agent’s strategy is defensive (D) if its opponent does not have a counter-strategy that allows it to reach its goal strictly first. A cooperative defensive (CD) strategy is both cooperative and defensive. We now proceed to describe a sample set of grid games, and equilibria comprised of CD strategies (when they exist), to illustrate the kinds of interactions we studied. Our first It is a simple matter to vary these rules within our infrastructure, as future experimental design might dictate. Figure 1: Hallway Figure 2: Intersection
[1]
A. Tversky,et al.
Judgment under Uncertainty: Heuristics and Biases
,
1974,
Science.
[2]
Michael L. Littman,et al.
Memoryless policies: theoretical limitations and practical results
,
1994
.
[3]
Michael L. Littman,et al.
Markov Games as a Framework for Multi-Agent Reinforcement Learning
,
1994,
ICML.
[4]
Robert H. Crites,et al.
Multiagent reinforcement learning in the Iterated Prisoner's Dilemma.
,
1996,
Bio Systems.
[5]
Michael H. Bowling,et al.
Convergence Problems of General-Sum Multiagent Reinforcement Learning
,
2000,
ICML.
[6]
Peter Stone,et al.
Implicit Negotiation in Repeated Games
,
2001,
ATAL.
[7]
Michael L. Littman,et al.
Friend-or-Foe Q-learning in General-Sum Games
,
2001,
ICML.
[8]
Manuela M. Veloso,et al.
Multiagent learning using a variable learning rate
,
2002,
Artif. Intell..
[9]
Keith B. Hall,et al.
Correlated Q-Learning
,
2003,
ICML.
[10]
Michael P. Wellman,et al.
Nash Q-Learning for General-Sum Stochastic Games
,
2003,
J. Mach. Learn. Res..
[11]
Peter Dayan,et al.
Q-learning
,
1992,
Machine Learning.
[12]
Nimrod Megiddo,et al.
Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
,
2004,
NIPS.
[13]
Ya'akov Gal,et al.
Learning Social Preferences in Games
,
2004,
AAAI.
[14]
Michael L. Littman,et al.
Cyclic Equilibria in Markov Games
,
2005,
NIPS.
[15]
Richard S. Sutton,et al.
Reinforcement Learning: An Introduction
,
1998,
IEEE Trans. Neural Networks.
[16]
Vincent Conitzer,et al.
AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents
,
2003,
Machine Learning.
[17]
Michael L. Littman,et al.
Social reward shaping in the prisoner's dilemma
,
2008,
AAMAS.
[18]
Ryszard Kowalczyk,et al.
Dynamic analysis of multiagent Q-learning with ε-greedy exploration
,
2009,
ICML '09.
[19]
Richard L. Lewis,et al.
Where Do Rewards Come From
,
2009
.
[20]
Michael L. Littman,et al.
Classes of Multiagent Q-learning Dynamics with epsilon-greedy Exploration
,
2010,
ICML.
[21]
Richard L. Lewis,et al.
Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective
,
2010,
IEEE Transactions on Autonomous Mental Development.
[22]
Michael L. Littman,et al.
Coco-Q: Learning in Stochastic Games with Side Payments
,
2013,
ICML.
[23]
ปิยดา สมบัติวัฒนา.
Behavioral Game Theory: Experiments in Strategic Interaction
,
2013
.
[24]
Jacob W. Crandall.
Non-Myopic Learning in Repeated Stochastic Games
,
2014,
ArXiv.