论文信息 - An Axiomatic Approach to Rationality for Reinforcement Learning Agents

An Axiomatic Approach to Rationality for Reinforcement Learning Agents

The status quo for objective function design in reinforcement learning (RL) is to use the value function of a Markov decision process (MDP). But this prescribes RL agents with an additive utility function, which is not obviously suitable for general purpose use. This paper presents a minimal axiomatic framework for rationality in sequential decision making and shows that the implied cardinal utility function is of a more general form than the discounted additive utility function of an MDP. In particular, our framework allows for a state-action dependent “discount” factor that is not constrained to be less than 1 (so long as there is eventual long run discounting). We show that although the MDP is not sufficiently expressive to model all rational preference structures (as defined by our framework), there exists a unique “optimizing MDP” whose optimal value function matches the utility of the optimal policy. The relation between the value and utility of suboptimal policies is quantified and the implications for objective function design in RL are discussed.

Silviu Pitis | Silviu Pitis

[1] M. Botvinick,et al. The successor representation in human reinforcement learning , 2016, bioRxiv.

[2] Silviu Pitis,et al. Source Traces for Temporal Difference Learning , 2018, AAAI.

[3] John G. Kemeny,et al. Finite Markov chains , 1960 .

[4] M. J. Sobel. Ordinal Dynamic Programming , 1975 .

[5] Shane Legg,et al. Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[6] T. Koopmans,et al. Two papers on the representation of preference orderings : representation of preference orderings with independent components of consumption, and, Representation of preference orderings over time , 1972 .

[7] David M. Kreps. Notes On The Theory Of Choice , 1988 .

[8] J. Rawls. A Theory of Justice , 1999 .

[9] Shanefrederick,et al. Time Discounting and Time Preference : A Critical Review , 2022 .

[10] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[11] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.