On the Expressivity of Markov Reward

Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of “task” that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings.

[1]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[2]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[3]  Michael H. Bowling,et al.  Apprenticeship learning using linear programming , 2008, ICML '08.

[4]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[5]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[6]  Ufuk Topcu,et al.  Environment-Independent Task Specifications via GLTL , 2017, ArXiv.

[7]  Johannes Fürnkranz,et al.  A Survey of Preference-Based Reinforcement Learning Methods , 2017, J. Mach. Learn. Res..

[8]  Jonathan Uesato,et al.  REALab: An Embedded Perspective on Tampering , 2020, ArXiv.

[9]  Smaranda Muresan,et al.  Grounding English Commands to Reward Functions , 2015, Robotics: Science and Systems.

[10]  David H. Ackley,et al.  Interactions between learning and evolution , 1991 .

[11]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, Comb..

[12]  G. Debreu Mathematical Economics: Representation of a preference ordering by a numerical function , 1983 .

[13]  Richard L. Lewis,et al.  Where Do Rewards Come From , 2009 .

[14]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[15]  L. G. Mitten Preference Order Dynamic Programming , 1974 .

[16]  Martha White,et al.  Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[17]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[18]  David M. Kreps Notes On The Theory Of Choice , 1988 .

[19]  Alan Fern,et al.  A Bayesian Approach for Policy Learning from Trajectory Preference Queries , 2012, NIPS.

[20]  Richard L. Lewis,et al.  Reward Design via Online Gradient Ascent , 2010, NIPS.

[21]  Sheila A. McIlraith,et al.  Teaching Multiple Tasks to an RL Agent using LTL , 2018, AAMAS.

[22]  Karl J. Friston,et al.  Reinforcement Learning or Active Inference? , 2009, PloS one.

[23]  Satinder Singh Baveja,et al.  The Optimal Reward Problem: Designing Effective Reward for Bounded Agents , 2011 .

[24]  Stefanie Tellex,et al.  Learning to Parse Natural Language to Grounded Reward Functions with Weak Supervision , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[26]  Ruosong Wang,et al.  Preference-based Reinforcement Learning with Finite-Time Guarantees , 2020, NeurIPS.

[27]  Matthew J. Sobel,et al.  Discounting axioms imply risk neutrality , 2012, Annals of Operations Research.

[28]  Geraud Nangue Tasse,et al.  A Boolean Task Algebra for Reinforcement Learning , 2020, NeurIPS.

[29]  Karl J. Friston The free-energy principle: a unified brain theory? , 2010, Nature Reviews Neuroscience.

[30]  Paul Weng,et al.  Markov Decision Processes with Ordinal Rewards: Reference Point-Based Preferences , 2011, ICAPS.

[31]  Silviu Pitis,et al.  Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach , 2019, AAAI.

[32]  David L. Roberts,et al.  Convergent Actor Critic by Humans , 2016 .

[33]  Sheila A. McIlraith,et al.  Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning , 2018, ICML.

[34]  Matthew E. Taylor,et al.  Maximum Reward Formulation In Reinforcement Learning , 2020, ArXiv.

[35]  Guan Wang,et al.  Interactive Learning from Policy-Dependent Human Feedback , 2017, ICML.

[36]  Michael Wooldridge,et al.  Multi-Agent Reinforcement Learning with Temporal Logic Specifications , 2021, AAMAS.

[37]  Karl J. Friston,et al.  Action and Perception as Divergence Minimization , 2020, ArXiv.

[38]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[39]  Stuart J. Russell,et al.  Benefits of Assistance over Reward Learning , 2020 .

[40]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[41]  T. Koopmans Stationary Ordinal Utility and Impatience , 1960 .

[42]  Peter A. Streufert Ordinal Dynamic Programming , 1991 .

[43]  Anca D. Dragan,et al.  Reward-rational (implicit) choice: A unifying formalism for reward learning , 2020, NeurIPS.

[44]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[45]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[46]  Calin Belta,et al.  Reinforcement learning with temporal logic rewards , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[47]  Anca D. Dragan,et al.  The Off-Switch Game , 2016, IJCAI.

[48]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[49]  Junhyuk Oh,et al.  What Can Learned Intrinsic Rewards Capture? , 2019, ICML.

[50]  Daniel Dewey,et al.  Reinforcement Learning and the Reward Engineering Principle , 2014, AAAI Spring Symposia.

[51]  Joel W. Burdick,et al.  Dueling Posterior Sampling for Preference-Based Reinforcement Learning , 2019, UAI.

[52]  Nan Jiang,et al.  Repeated Inverse Reinforcement Learning , 2017, NIPS.

[53]  A. Barto,et al.  On Separating Agent Designer Goals from Agent Goals : Breaking the Preferences – Parameters Confound , 2010 .

[54]  Michael Matthews,et al.  The Alignment Problem: Machine Learning and Human Values , 2022, Personnel Psychology.

[55]  Marcus Hutter,et al.  Axioms for Rational Reinforcement Learning , 2011, ALT.

[56]  Yoshua Bengio,et al.  Hyperbolic Discounting and Learning over Multiple Horizons , 2019, ArXiv.

[57]  Rajeev Alur,et al.  A Composable Specification Language for Reinforcement Learning Tasks , 2020, NeurIPS.

[58]  Laurent Orseau,et al.  Reinforcement Learning with a Corrupted Reward Channel , 2017, IJCAI.

[59]  Nathalie Bertrand,et al.  The Steady-State Control Problem for Markov Decision Processes , 2013, QEST.

[60]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[61]  Maja J. Mataric,et al.  Reward Functions for Accelerated Learning , 1994, ICML.

[62]  Doina Precup,et al.  Reward is enough , 2021, Artif. Intell..