Interactive Value Iteration for Markov Decision Processes with Unknown Rewards

To tackle the potentially hard task of defining the reward function in a Markov Decision Process, we propose a new approach, based on Value Iteration, which interweaves the elicitation and optimization phases. We assume that rewards whose numeric values are unknown can only be ordered, and that a tutor is present to help comparing sequences of rewards. We first show how the set of possible reward functions for a given preference relation can be represented as a polytope. Then our algorithm, called Interactive Value Iteration, searches for an optimal policy while refining its knowledge about the possible reward functions, by querying a tutor when necessary. We prove that the number of queries needed before finding an optimal policy is upperbounded by a polynomial in the size of the problem, and we present experimental results which demonstrate that our approach is efficient in practice.

[1]  Craig Boutilier,et al.  Regret-based Reward Elicitation for Markov Decision Processes , 2009, UAI.

[2]  Jaap Van Brakel,et al.  Foundations of measurement , 1983 .

[3]  Moshe Shaked,et al.  Stochastic orders and their applications , 1994 .

[4]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[5]  Craig Boutilier,et al.  Cooperative Negotiation in Autonomic Systems using Incremental Utility Elicitation , 2002, UAI.

[6]  Craig Boutilier,et al.  Online feature elicitation in interactive optimization , 2009, ICML '09.

[7]  Craig Boutilier,et al.  Robust Online Optimization of Reward-Uncertain MDPs , 2011, IJCAI.

[8]  Shie Mannor,et al.  Parametric regret in uncertain Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[9]  A. Tversky,et al.  Foundations of Measurement, Vol. I: Additive and Polynomial Representations , 1991 .

[10]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: a formal framework and a policy iteration algorithm , 2012, Mach. Learn..

[11]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[12]  Craig Boutilier,et al.  Recommendation Sets and Choice Queries: There Is No Exploration/Exploitation Tradeoff! , 2011, AAAI.

[13]  Paul Weng,et al.  Markov Decision Processes with Ordinal Rewards: Reference Point-Based Preferences , 2011, ICAPS.

[14]  Jesse Hoey,et al.  A planning system based on Markov decision processes to guide people with dementia through activities of daily living , 2006, IEEE Transactions on Information Technology in Biomedicine.

[15]  D. White Multi-objective infinite-horizon discounted Markov decision processes , 1982 .

[16]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[17]  Craig Boutilier,et al.  Eliciting Additive Reward Functions for Markov Decision Processes , 2011, IJCAI.

[18]  Craig Boutilier,et al.  Robust Policy Computation in Reward-Uncertain MDPs Using Nondominated Policies , 2010, AAAI.

[19]  Patrick Suppes,et al.  Foundations of measurement , 1971 .

[20]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[21]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.