论文信息 - A globally optimal algorithm for TTD-MDPs

A globally optimal algorithm for TTD-MDPs

In this paper, we discuss the use of Targeted Trajectory Distribution Markov Decision Processes (TTD-MDPs)---a variant of MDPs in which the goal is to realize a specified distribution of trajectories through a state space---as a general agent-coordination framework. We present several advances to previous work on TTD-MDPs. We improve on the existing algorithm for solving TTD-MDPs by deriving a greedy algorithm that finds a policy that provably minimizes the global KL-divergence from the target distribution. We test the new algorithm by applying TTD-MDPs to drama management, where a system must coordinate the behavior of many agents to ensure that a game follows a coherent storyline, is in keeping with the author's desires, and offers a high degree of replayability. Although we show that suboptimal greedy strategies will fail in some cases, we validate previous work that suggests that they can work well in practice. We also show that our new algorithm provides guaranteed accuracy even in those cases, with little additional computational cost. Further, we illustrate how this new approach can be applied online, eliminating the memory-intensive offline sampling necessary in the previous approach.

[1] Joseph Bates,et al. Guiding interactive drama , 1997 .

[2] James C. Lester,et al. U-director: a decision-theoretic narrative planning architecture for storytelling environments , 2006, AAMAS '06.

[3] Zinovi Rabinovich,et al. On the response of EMT-based control to interacting targets and models , 2006, AAMAS '06.

[4] Zinovi Rabinovich,et al. Multiagent coordination by Extended Markov Tracking , 2005, AAMAS '05.

[5] Andrew Stern,et al. Integrating Plot, Character and Natural Language Processing in the Interactive Drama Façade , 2003 .

[6] Gerald Tesauro,et al. Practical issues in temporal difference learning , 1992, Machine Learning.

[7] David L. Roberts,et al. Reinforcement learning for declarative optimization-based drama management , 2006, AAMAS '06.

[8] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[9] TesauroGerald. Practical Issues in Temporal Difference Learning , 1992 .

[10] Gerald Tesauro,et al. Temporal difference learning and TD-Gammon , 1995, CACM.

[11] Andrew Stern,et al. Mixing Story and Simulation in Interactive Narrative , 2006, AIIDE.

[12] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[13] Peter Stone,et al. A social reinforcement learning agent , 2001, AGENTS '01.

[14] Joseph Bates,et al. Virtual Reality, Art, and Entertainment , 1992, Presence: Teleoperators & Virtual Environments.

[15] Michael Mateas,et al. An Oz-Centric Review of Interactive Drama and Believable Agents , 1999, Artificial Intelligence Today.

[16] R. Michael Young,et al. An architecture for integrating plan-based behavior generation with interactive game environments , 2004, J. Game Dev..

[17] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[18] Gerald Tesauro,et al. Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[19] Michael Mateas,et al. Search-Based Drama Management in the Interactive Fiction Anchorhead , 2005, AIIDE.

[20] Brian Magerko,et al. Story Representation and Interactive Drama , 2005, AIIDE.

[21] Yishay Mansour,et al. Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[22] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[23] Brenda Kay Laurel,et al. Toward the design of a computer-based interactive fantasy system / , 1986 .

[24] David L. Roberts,et al. Targeting Specific Distributions of Trajectories in MDPs , 2006, AAAI.