Dynamic preferences in multi-criteria reinforcement learning

The current framework of reinforcement learning is based on maximizing the expected returns based on scalar rewards. But in many real world situations, tradeoffs must be made among multiple objectives. Moreover, the agent's preferences between different objectives may vary with time. In this paper, we consider the problem of learning in the presence of time-varying preferences among multiple objectives, using numeric weights to represent their importance. We propose a method that allows us to store a finite number of policies, choose an appropriate policy for any weight vector and improve upon it. The idea is that although there are infinitely many weight vectors, they may be well-covered by a small number of optimal policies. We show this empirically in two domains: a version of the Buridan's ass problem and network routing.

[1]  D. White Multi-objective infinite-horizon discounted Markov decision processes , 1982 .

[2]  Anne Lohrli Chapman and Hall , 1985 .

[3]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[4]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5]  Eugene A. Feinberg,et al.  Constrained Markov Decision Models with Weighted Discounted Rewards , 1995, Math. Oper. Res..

[6]  Prasad Tadepalli,et al.  Model-Based Average Reward Reinforcement Learning , 1998, Artif. Intell..

[7]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[8]  Csaba Szepesvári,et al.  Multi-criteria Reinforcement Learning , 1998, ICML.

[9]  Ronald Parr,et al.  Flexible Decomposition Algorithms for Weakly Coupled Markov Decision Problems , 1998, UAI.

[10]  E. Altman Constrained Markov Decision Processes , 1999 .

[11]  Keith W. Ross,et al.  Computer networking - a top-down approach featuring the internet , 2000 .

[12]  Peter Stone TPOT-RL Applied to Network Routing , 2000, ICML.

[13]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[14]  Daphne Koller,et al.  Learning an Agent's Utility Function by Observing Behavior , 2001, ICML.

[15]  Lex Weaver,et al.  A Multi-Agent Policy-Gradient Approach to Network Routing , 2001, ICML.

[16]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[17]  Craig Boutilier,et al.  A POMDP formulation of preference elicitation problems , 2002, AAAI/IAAI.

[18]  Bharat K. Bhargava,et al.  Study of distance vector routing protocols for mobile ad hoc networks , 2003, Proceedings of the First IEEE International Conference on Pervasive Computing and Communications, 2003. (PerCom 2003)..

[19]  Stuart J. Russell,et al.  Q-Decomposition for Reinforcement Learning Agents , 2003, ICML.

[20]  Shie Mannor,et al.  A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[21]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[22]  SRIDHAR MAHADEVAN,et al.  Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.