Heuristic Search Value Iteration for Zero-Sum Stochastic Games

In sequential decision making, heuristic search algorithms allow exploiting both the initial situation and an admissible heuristic to efficiently search for an optimal solution, often for planning purposes. Such algorithms exist for problems with uncertain dynamics, partial observability, multiple criteria, or multiple collaborating agents. In this article, we look at two-player zero-sum stochastic games (zsSGs) with a discounted criterion, in a view to propose a solution tailored to the fully observable case, while solutions have been proposed for particular, though still more general, partially observable cases. This setting induces reasoning on both a lower and an upper bound of the value function, which leads us to proposing zsSG-HSVI, an algorithm based on heuristic search value iteration (HSVI), and which thus relies on generating trajectories. We demonstrate that, each player acting optimistically, and employing simple heuristic initializations, HSVI's convergence in finite time to an $\epsilon$-optimal solution is preserved. An empirical study of the resulting approach is conducted on benchmark problems of various sizes.