An Axiomatic Approach to Rationality for Reinforcement Learning Agents

The status quo for objective function design in reinforcement learning (RL) is to use the value function of a Markov decision process (MDP). But this prescribes RL agents with an additive utility function, which is not obviously suitable for general purpose use. This paper presents a minimal axiomatic framework for rationality in sequential decision making and shows that the implied cardinal utility function is of a more general form than the discounted additive utility function of an MDP. In particular, our framework allows for a state-action dependent “discount” factor that is not constrained to be less than 1 (so long as there is eventual long run discounting). We show that although the MDP is not sufficiently expressive to model all rational preference structures (as defined by our framework), there exists a unique “optimizing MDP” whose optimal value function matches the utility of the optimal policy. The relation between the value and utility of suboptimal policies is quantified and the implications for objective function design in RL are discussed.

[1]  M. Botvinick,et al.  The successor representation in human reinforcement learning , 2016, bioRxiv.

[2]  Silviu Pitis,et al.  Source Traces for Temporal Difference Learning , 2018, AAAI.

[3]  John G. Kemeny,et al.  Finite Markov chains , 1960 .

[4]  M. J. Sobel Ordinal Dynamic Programming , 1975 .

[5]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[6]  T. Koopmans,et al.  Two papers on the representation of preference orderings : representation of preference orderings with independent components of consumption, and, Representation of preference orderings over time , 1972 .

[7]  David M. Kreps Notes On The Theory Of Choice , 1988 .

[8]  J. Rawls A Theory of Justice , 1999 .

[9]  Shanefrederick,et al.  Time Discounting and Time Preference : A Critical Review , 2022 .

[10]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[11]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[12]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[13]  Stuart Armstrong,et al.  Impossibility of deducing preferences and rationality from human policy , 2017, NIPS 2018.

[14]  Evan L. Porteus On the Optimality of Structured Policies in Countable Stage Decision Processes , 1975 .

[15]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[16]  Larry G. Epstein Stationary cardinal utility and optimal growth under uncertainty , 1983 .

[17]  Stuart J. Russell Learning agents for uncertain environments (extended abstract) , 1998, COLT' 98.

[18]  M. Machina Dynamic Consistency and Non-expected Utility Models of Choice under Uncertainty , 1989 .

[19]  Evan L. Porteus,et al.  Temporal Resolution of Uncertainty and Dynamic Choice Theory , 1978 .

[20]  David M. Kreps Decision Problems with Expected Utility Critera, I: Upper and Lower Convergent Utility , 1977, Math. Oper. Res..

[21]  S. C. Jaquette A Utility Criterion for Markov Decision Processes , 1976 .

[22]  Andrea Lockerd Thomaz,et al.  Policy Shaping: Integrating Human Feedback with Reinforcement Learning , 2013, NIPS.

[23]  T. Koopmans Stationary Ordinal Utility and Impatience , 1960 .

[24]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[25]  Matthew J. Sobel,et al.  Discounting axioms imply risk neutrality , 2012, Annals of Operations Research.

[26]  Johannes Fürnkranz,et al.  A Survey of Preference-Based Reinforcement Learning Methods , 2017, J. Mach. Learn. Res..

[27]  Stuart J. Russell,et al.  Rationality and Intelligence: A Brief Update , 2013, PT-AI.

[28]  J. Neumann,et al.  Theory of games and economic behavior , 1945, 100 Years of Math Milestones.

[29]  Evan L. Porteus,et al.  Dynamic Choice Theory and Dynamic Programming , 1979 .

[30]  P. Diamond The Evaluation of Infinite Utility Streams , 1965 .

[31]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[32]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[33]  Alan Fern,et al.  A Bayesian Approach for Policy Learning from Trajectory Preference Queries , 2012, NIPS.

[34]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[35]  Michèle Sebag,et al.  Preference-Based Policy Learning , 2011, ECML/PKDD.

[36]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[37]  R. Bellman A Markovian Decision Process , 1957 .

[38]  A. Tversky,et al.  Rational choice and the framing of decisions , 1990 .