Online planning for optimal protector strategies in resource conservation games

Protecting our environment and natural resources is a major global challenge. "Protectors" (law enforcement agencies) try to protect these natural resources, while "extractors" (criminals) seek to exploit them. In many domains, such as illegal fishing, the extractors know more about the distribution and richness of the resources than the protectors, making it extremely difficult for the protectors to optimally allocate their assets for patrol and interdiction. Fortunately, extractors carry out frequent illegal extractions, so protectors can learn about the richness of resources by observing the extractor's behavior. This paper presents an approach for allocating protector assets based on learning from extractors. We make the following four specific contributions: (i) we model resource conservation as a repeated game; (ii) we transform this repeated game into a POMDP by adopting a fixed model for the adversary's behavior, which cannot be solved by the latest general POMDP solvers due to its exponential state space; (iii) in response, we propose GMOP, a dedicated algorithm that combines Gibbs sampling with Monte Carlo tree search for online planning in this POMDP; (iv) for a specific class of our game, we can speed up the GMOP algorithm without sacrificing solution quality, as well as provide a heuristic that trades off solution quality for lower computational cost.

[1]  R. McKelvey,et al.  Quantal Response Equilibria for Normal Form Games , 1995 .

[2]  Vladik Kreinovich,et al.  Security games with interval uncertainty , 2013, AAMAS.

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[4]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[5]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[6]  Rong Yang,et al.  Adaptive resource allocation for wildlife protection against illegal poachers , 2014, AAMAS.

[7]  Vincent Conitzer,et al.  Learning and Approximating the Optimal Strategy to Commit To , 2009, SAGT.

[8]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[9]  O. H. Brownlee,et al.  ACTIVITY ANALYSIS OF PRODUCTION AND ALLOCATION , 1952 .

[10]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[11]  Jérôme Renault,et al.  Repeated Games with Incomplete Information , 2009, Encyclopedia of Complexity and Systems Science.

[12]  Sarit Kraus,et al.  Playing games for security: an efficient exact algorithm for solving Bayesian Stackelberg games , 2008, AAMAS.

[13]  Gerald Tesauro,et al.  Playing repeated Stackelberg games with unknown opponents , 2012, AAMAS.

[14]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[15]  David Hsu,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[16]  References , 1971 .

[17]  Milind Tambe,et al.  A unified method for handling discrete and continuous uncertainty in Bayesian Stackelberg games , 2012, AAMAS.

[18]  Milind Tambe,et al.  Security and Game Theory - Algorithms, Deployed Systems, Lessons Learned , 2011 .

[19]  Trey Smith,et al.  Probabilistic planning for robotic exploration , 2007 .

[20]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.