Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach

Reinforcement learning (RL) agents have traditionally been tasked with maximizing the value function of a Markov decision process (MDP), either in continuous settings, with fixed discount factor $\gamma < 1$, or in episodic settings, with $\gamma = 1$. While this has proven effective for specific tasks with well-defined objectives (e.g., games), it has never been established that fixed discounting is suitable for general purpose use (e.g., as a model of human preferences). This paper characterizes rationality in sequential decision making using a set of seven axioms and arrives at a form of discounting that generalizes traditional fixed discounting. In particular, our framework admits a state-action dependent "discount" factor that is not constrained to be less than 1, so long as there is eventual long run discounting. Although this broadens the range of possible preference structures in continuous settings, we show that there exists a unique "optimizing MDP" with fixed $\gamma < 1$ whose optimal value function matches the true utility of the optimal policy, and we quantify the difference between value and utility for suboptimal policies. Our work can be seen as providing a normative justification for (a slight generalization of) Martha White's RL task formalism (2017) and other recent departures from the traditional RL, and is relevant to task specification in RL, inverse RL and preference-based RL.

[1]  Andrea Lockerd Thomaz,et al.  Policy Shaping: Integrating Human Feedback with Reinforcement Learning , 2013, NIPS.

[2]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[3]  S. C. Jaquette A Utility Criterion for Markov Decision Processes , 1976 .

[4]  A. Tversky,et al.  Rational choice and the framing of decisions , 1990 .

[5]  Martha White,et al.  Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[6]  Stuart Armstrong,et al.  Occam's razor is insufficient to infer the preferences of irrational agents , 2017, NeurIPS.

[7]  T. Koopmans Stationary Ordinal Utility and Impatience , 1960 .

[8]  K. Vind,et al.  Preferences over time , 2003 .

[9]  P. Diamond The Evaluation of Infinite Utility Streams , 1965 .

[10]  Evan L. Porteus,et al.  Temporal Resolution of Uncertainty and Dynamic Choice Theory , 1978 .

[11]  David M. Kreps Decision Problems with Expected Utility Critera, I: Upper and Lower Convergent Utility , 1977, Math. Oper. Res..

[12]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[13]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[14]  M. Botvinick,et al.  The successor representation in human reinforcement learning , 2016, Nature Human Behaviour.

[15]  T. Koopmans,et al.  Two papers on the representation of preference orderings : representation of preference orderings with independent components of consumption, and, Representation of preference orderings over time , 1972 .

[16]  David M. Kreps Notes On The Theory Of Choice , 1988 .

[17]  G. Loewenstein,et al.  Time Discounting and Time Preference: A Critical Review , 2002 .

[18]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[19]  Silviu Pitis,et al.  Source Traces for Temporal Difference Learning , 2018, AAAI.

[20]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[21]  Evan L. Porteus On the Optimality of Structured Policies in Countable Stage Decision Processes , 1975 .

[22]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[23]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[24]  R. Bellman A Markovian Decision Process , 1957 .

[25]  Larry G. Epstein Stationary cardinal utility and optimal growth under uncertainty , 1983 .

[26]  M. Machina Dynamic Consistency and Non-expected Utility Models of Choice under Uncertainty , 1989 .

[27]  John C. Harsanyi,et al.  Cardinal Utility in Welfare Economics and in the Theory of Risk-taking , 1953, Journal of Political Economy.

[28]  J. Rawls A Theory of Justice , 1999 .

[29]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[30]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[31]  Alan Fern,et al.  A Bayesian Approach for Policy Learning from Trajectory Preference Queries , 2012, NIPS.

[32]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[33]  Michèle Sebag,et al.  Preference-Based Policy Learning , 2011, ECML/PKDD.

[34]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[35]  Matthew J. Sobel,et al.  Discounting axioms imply risk neutrality , 2012, Annals of Operations Research.

[36]  Johannes Fürnkranz,et al.  A Survey of Preference-Based Reinforcement Learning Methods , 2017, J. Mach. Learn. Res..

[37]  Stuart J. Russell,et al.  Rationality and Intelligence: A Brief Update , 2013, PT-AI.

[38]  B. Nordstrom FINITE MARKOV CHAINS , 2005 .

[39]  Peter A. Streufert Ordinal Dynamic Programming , 1991 .