论文信息 - Interactive Value Iteration for Markov Decision Processes with Unknown Rewards

Interactive Value Iteration for Markov Decision Processes with Unknown Rewards

To tackle the potentially hard task of defining the reward function in a Markov Decision Process, we propose a new approach, based on Value Iteration, which interweaves the elicitation and optimization phases. We assume that rewards whose numeric values are unknown can only be ordered, and that a tutor is present to help comparing sequences of rewards. We first show how the set of possible reward functions for a given preference relation can be represented as a polytope. Then our algorithm, called Interactive Value Iteration, searches for an optimal policy while refining its knowledge about the possible reward functions, by querying a tutor when necessary. We prove that the number of queries needed before finding an optimal policy is upperbounded by a polynomial in the size of the problem, and we present experimental results which demonstrate that our approach is efficient in practice.

Bruno Zanuttini | Paul Weng | Paul Weng | B. Zanuttini

[1] Craig Boutilier,et al. Regret-based Reward Elicitation for Markov Decision Processes , 2009, UAI.

[2] Jaap Van Brakel,et al. Foundations of measurement , 1983 .

[3] Moshe Shaked,et al. Stochastic orders and their applications , 1994 .

[4] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[5] Craig Boutilier,et al. Cooperative Negotiation in Autonomic Systems using Incremental Utility Elicitation , 2002, UAI.

[6] Craig Boutilier,et al. Online feature elicitation in interactive optimization , 2009, ICML '09.

[7] Craig Boutilier,et al. Robust Online Optimization of Reward-Uncertain MDPs , 2011, IJCAI.

[8] Shie Mannor,et al. Parametric regret in uncertain Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[9] A. Tversky,et al. Foundations of Measurement, Vol. I: Additive and Polynomial Representations , 1991 .

[10] Eyke Hüllermeier,et al. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm , 2012, Mach. Learn..

[11] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[12] Craig Boutilier,et al. Recommendation Sets and Choice Queries: There Is No Exploration/Exploitation Tradeoff! , 2011, AAAI.

[13] Paul Weng,et al. Markov Decision Processes with Ordinal Rewards: Reference Point-Based Preferences , 2011, ICAPS.

[14] Jesse Hoey,et al. A planning system based on Markov decision processes to guide people with dementia through activities of daily living , 2006, IEEE Transactions on Information Technology in Biomedicine.

[15] D. White. Multi-objective infinite-horizon discounted Markov decision processes , 1982 .

[16] Craig Boutilier,et al. Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[17] Craig Boutilier,et al. Eliciting Additive Reward Functions for Markov Decision Processes , 2011, IJCAI.

[18] Craig Boutilier,et al. Robust Policy Computation in Reward-Uncertain MDPs Using Nondominated Policies , 2010, AAAI.

[19] Patrick Suppes,et al. Foundations of measurement , 1971 .

[20] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[21] Leslie Pack Kaelbling,et al. On the Complexity of Solving Markov Decision Problems , 1995, UAI.