Preference-based reinforcement learning: a formal framework and a policy iteration algorithm

This paper makes a first step toward the integration of two subfields of machine learning, namely preference learning and reinforcement learning (RL). An important motivation for a preference-based approach to reinforcement learning is the observation that in many real-world domains, numerical feedback signals are not readily available, or are defined arbitrarily in order to satisfy the needs of conventional RL algorithms. Instead, we propose an alternative framework for reinforcement learning, in which qualitative reward signals can be directly used by the learner. The framework may be viewed as a generalization of the conventional RL framework in which only a partial order between policies is required instead of the total order induced by their respective expected long-term reward.Therefore, building on novel methods for preference learning, our general goal is to equip the RL agent with qualitative policy models, such as ranking functions that allow for sorting its available actions from most to least promising, as well as algorithms for learning such models from qualitative feedback. As a proof of concept, we realize a first simple instantiation of this framework that defines preferences based on utilities observed for trajectories. To that end, we build on an existing method for approximate policy iteration based on roll-outs. While this approach is based on the use of classification methods for generalization and policy learning, we make use of a specific type of preference learning method called label ranking. Advantages of preference-based approximate policy iteration are illustrated by means of two case studies.

[1]  Jan Peters,et al.  Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[2]  Didier Dubois,et al.  Qualitative decision theory with preference relations and comparative uncertainty: An axiomatic approach , 2003, Artif. Intell..

[3]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[4]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[5]  Robert Givan,et al.  Approximate Policy Iteration with a Policy Language Bias , 2003, NIPS.

[6]  Konkoly Thege Multi-criteria Reinforcement Learning , 1998 .

[7]  Jude W. Shavlik,et al.  Giving Advice about Preferred Actions to Reinforcement Learners Via Knowledge-Based Kernel Regression , 2005, AAAI.

[8]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[9]  Johannes Fürnkranz,et al.  Efficient prediction algorithms for binary decomposition techniques , 2011, Data Mining and Knowledge Discovery.

[10]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[11]  A. Lazaric,et al.  Rollout Allocation Strategies for Classification-based Policy Iteration , 2010 .

[12]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[13]  Zbigniew Michalewicz,et al.  Evolutionary Computation 2 , 2000 .

[14]  Francis Maes Learning in Markov decision processes for structured prediction : applications to sequence labeling, tree transformation and learning for search , 2009 .

[15]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[16]  Csaba Szepesvári,et al.  Multi-criteria Reinforcement Learning , 1998, ICML.

[17]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[18]  Shie Mannor,et al.  The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[19]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[20]  Robert Givan,et al.  Relational Reinforcement Learning: An Overview , 2004, ICML 2004.

[21]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[22]  Kurt Driessens,et al.  Relational Reinforcement Learning , 1998, Machine-mediated learning.

[23]  Ronen I. Brafman,et al.  Modeling Agents as Qualitative Decision Makers , 1997, Artif. Intell..

[24]  Christian Igel,et al.  Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search , 2009, ICML '09.

[25]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[26]  Donato Malerba,et al.  Machine Learning and Knowledge Discovery in Databases, Part III: European Conference, ECML PKDD 2010, Athens, Greece, September 5-9, 2011, ... / Lecture Notes in Artificial Intelligence) , 2011 .

[27]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for Reinforcement Learning , 2003, ICML.

[28]  Gerald Tesauro,et al.  Programming backgammon using self-teaching neural nets , 2002, Artif. Intell..

[29]  Eyke Hüllermeier,et al.  Preference-Based Policy Iteration: Leveraging Preference Learning for Reinforcement Learning , 2011, ECML/PKDD.

[30]  Bart Selman,et al.  On Adversarial Search Spaces and Sampling-Based Planning , 2010, ICAPS.

[31]  M. Kosorok,et al.  Reinforcement learning design for cancer clinical trials , 2009, Statistics in medicine.

[32]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[33]  Bruno Scherrer,et al.  Classification-based Policy Iteration with a Critic , 2011, ICML.

[34]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[35]  Shotaro Akaho,et al.  A Survey and Empirical Comparison of Object Ranking Methods , 2010, Preference Learning.

[36]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[37]  Claude Sammut,et al.  Automatic construction of reactive control systems using symbolic machine learning , 1996, The Knowledge Engineering Review.

[38]  Donald F. Beal,et al.  Temporal difference learning applied to game playing and the results of application to Shogi , 2001, Theor. Comput. Sci..

[39]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[40]  Johannes Fürnkranz,et al.  Learning to Use Operational Advice , 2000, ECAI.

[41]  Saso Dzeroski,et al.  Integrating Guidance into Relational Reinforcement Learning , 2004, Machine Learning.

[42]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[43]  Ashok Mohan Learning qualitative models by an autonomous robot , 2008 .

[44]  Eyke Hüllermeier,et al.  Label ranking by learning pairwise preferences , 2008, Artif. Intell..

[45]  Geoffrey I. Webb,et al.  Encyclopedia of Machine Learning , 2011, Encyclopedia of Machine Learning.

[46]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[47]  Blai Bonet,et al.  Qualitative MDPs and POMDPs: An Order-Of-Magnitude Approximation , 2002, UAI.

[48]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[49]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[50]  Jude W. Shavlik,et al.  Using Advice to Transfer Knowledge Acquired in One Reinforcement Learning Task to Another , 2005, ECML.

[51]  William A. Massey,et al.  Stochastic ordering for Markov processes on partially ordered spaces with applications to queueing networks , 1991 .

[52]  Régis Sabbadin,et al.  A Possibilistic Model for Qualitative Sequential Decision Problems under Uncertainty in Partially Observable Environments , 1999, UAI.

[53]  Kalyanmoy Deb,et al.  Muiltiobjective Optimization Using Nondominated Sorting in Genetic Algorithms , 1994, Evolutionary Computation.

[54]  Jon Doyle,et al.  Background to Qualitative Decision Theory , 1999, AI Mag..

[55]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[56]  Luis Enrique Sucar,et al.  Abstraction and Refinement for Solving Continuous Markov Decision Processes , 2006, Probabilistic Graphical Models.

[57]  Hélène Fargier,et al.  Qualitative Decision under Uncertainty: Back to Expected Utility , 2003, IJCAI.

[58]  Christos Dimitrakakis,et al.  Rollout sampling approximate policy iteration , 2008, Machine Learning.

[59]  Ivan Bratko,et al.  Learning Qualitative Models , 2004, AI Mag..

[60]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[61]  Gerald DeJong,et al.  Qualitative reinforcement learning , 2006, ICML.

[62]  Jude W. Shavlik,et al.  Creating Advice-Taking Reinforcement Learners , 1998, Machine Learning.

[63]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[64]  Shie Mannor,et al.  A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[65]  Eyke Hüllermeier,et al.  Preference Learning , 2005, Künstliche Intell..

[66]  Eyke Hllermeier,et al.  Preference Learning , 2010 .

[67]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[68]  Michèle Sebag,et al.  Preference-Based Policy Learning , 2011, ECML/PKDD.

[69]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[70]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[71]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[72]  Kristian Kersting,et al.  Non-parametric policy gradients: a unified treatment of propositional and relational domains , 2008, ICML '08.

[73]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[74]  Johannes Fürnkranz,et al.  Learning the Piece Values for Three Chess Variants , 2008, J. Int. Comput. Games Assoc..

[75]  Evan Dekker,et al.  Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.

[76]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[77]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[78]  Andrew Tridgell,et al.  Learning to Play Chess Using Temporal Differences , 2000, Machine Learning.

[79]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[80]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[81]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[82]  Johannes Fürnkranz,et al.  Machine Learning and Game Playing , 2010, Encyclopedia of Machine Learning and Data Mining.

[83]  KasabovNikola,et al.  2008 Special issue , 2008 .

[84]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[85]  Peter Struss,et al.  Qualitative Reasoning , 1997, The Computer Science and Engineering Handbook.

[86]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[87]  Thomas Gärtner,et al.  Label Ranking Algorithms: A Survey , 2010, Preference Learning.

[88]  Anthony G. Cohn,et al.  Qualitative Reasoning , 1987, Advanced Topics in Artificial Intelligence.

[89]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.