The Impatient May Use Limited Optimism to Minimize Regret

Discounted-sum games provide a formal model for the study of reinforcement learning, where the agent is enticed to get rewards early since later rewards are discounted. When the agent interacts with the environment, she may regret her actions, realizing that a previous choice was suboptimal given the behavior of the environment. The main contribution of this paper is a PSPACE algorithm for computing the minimum possible regret of a given game. To this end, several results of independent interest are shown. (1) We identify a class of regret-minimizing and admissible strategies that first assume that the environment is collaborating, then assume it is adversarial---the precise timing of the switch is key here. (2) Disregarding the computational cost of numerical analysis, we provide an NP algorithm that checks that the regret entailed by a given time-switching strategy exceeds a given value. (3) We show that determining whether a strategy minimizes regret is decidable in PSPACE.

[1]  Krishnendu Chatterjee,et al.  Ergodic Mean-Payoff Games for the Analysis of Attacks in Crypto-Currencies , 2018, CONCUR.

[2]  Orna Kupferman,et al.  Reasoning about online algorithms with weighted automata , 2009, TALG.

[3]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[4]  Lorenzo Clemente,et al.  Non-Zero Sum Games for Reactive Synthesis , 2015, LATA.

[5]  Joseph Y. Halpern,et al.  Iterated Regret Minimization: A New Solution Concept , 2009, IJCAI.

[6]  Sanjeev Arora,et al.  Computational Complexity: A Modern Approach , 2009 .

[7]  C. Reutenauer The Mathematics of Petri Nets , 1990 .

[8]  Krzysztof R. Apt,et al.  Lectures in Game Theory for Computer Scientists , 2011 .

[9]  Felipe Cucker,et al.  A Polynomial Time Algorithm for Diophantine Equations in One Variable , 1999, J. Symb. Comput..

[10]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[11]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[12]  Guillermo A. Pérez,et al.  On delay and regret determinization of max-plus automata , 2017, 2017 32nd Annual ACM/IEEE Symposium on Logic in Computer Science (LICS).

[13]  Guillermo A. Pérez,et al.  Admissibility in Quantitative Graph Games , 2016, FSTTCS.

[14]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[15]  Marcin Jurdziński,et al.  Deciding the Winner in Parity Games is in UP \cap co-Up , 1998, Inf. Process. Lett..

[16]  Guillermo A. Pérez,et al.  Minimizing Regret in Discounted-Sum Games , 2016, CSL.

[17]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[18]  Guillermo A. Pérez,et al.  Reactive synthesis without regret , 2016, Acta Informatica.

[19]  Thomas A. Henzinger,et al.  Discounting the Future in Systems Theory , 2003, ICALP.

[20]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[21]  Uri Zwick,et al.  The Complexity of Mean Payoff Games on Graphs , 1996, Theor. Comput. Sci..

[22]  Jean-François Raskin,et al.  Iterated Regret Minimization in Game Graphs , 2010, MFCS.

[23]  Peter Bro Miltersen,et al.  2 The Task of a Numerical Analyst , 2022 .