Dueling RL: Reinforcement Learning with Trajectory Preferences

We consider the problem of preference based reinforcement learning (PbRL), where, unlike traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) preference over a trajectory pair instead of absolute rewards for them. The success of the traditional RL framework crucially relies on the underlying agent-reward model, which, however, depends on how accurately a system designer can express an appropriate reward function and often a non-trivial task. The main novelty of our framework is the ability to learn from preferencebased trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension d. Assuming the transition model is known, we then propose an algorithm with almost optimal regret guarantee of Õ ( SHd log(T/δ) √ T ) . We further, extend the above algorithm to the case of unknown transition dynamics, and provide an algorithm with near optimal regret guarantee Õ(( √ d + H + |S|) √ dT + √ |S||A|TH). To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference based RL problem with trajectory preferences.

[1]  Chi Jin,et al.  Provable Self-Play Algorithms for Competitive Reinforcement Learning , 2020, ICML.

[2]  Johannes Fürnkranz,et al.  Preference-Based Reinforcement Learning: A Preliminary Survey , 2013 .

[3]  Joel W. Burdick,et al.  Dueling Posterior Sampling for Preference-Based Reinforcement Learning , 2019, UAI.

[4]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[5]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[6]  Marc Abeille,et al.  Improved Optimistic Algorithms for Logistic Bandits , 2020, ICML.

[7]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[8]  Qiaomin Xie,et al.  Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium , 2020, COLT 2020.

[9]  Jasjeet S. Sekhon,et al.  Time-uniform, nonparametric, nonasymptotic confidence sequences , 2020, The Annals of Statistics.

[10]  Mohammad Sadegh Talebi,et al.  Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs , 2018, ALT.

[11]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[12]  David Hsu,et al.  Learning Dynamic Robot-to-Human Object Handover from Human Feedback , 2016, ISRR.

[13]  E. Kaufmann,et al.  Regret Bounds for Kernel-Based Reinforcement Learning , 2020, ArXiv.

[14]  Thorsten Joachims,et al.  Reducing Dueling Bandits to Cardinal Bandits , 2014, ICML.

[15]  S. Singh,et al.  Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System , 2011, J. Artif. Intell. Res..

[16]  Hiroshi Nakagawa,et al.  Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem , 2015, COLT.

[17]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[18]  Anca D. Dragan,et al.  Active Preference-Based Learning of Reward Functions , 2017, Robotics: Science and Systems.

[19]  M. de Rijke,et al.  Copeland Dueling Bandits , 2015, NIPS.

[20]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[21]  Jon D. McAuliffe,et al.  Time-uniform Chernoff bounds via nonnegative supermartingales , 2018, Probability Surveys.

[22]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm , 2014, Machine Learning.

[23]  Ronald Ortner,et al.  Regret Bounds for Reinforcement Learning via Markov Chain Concentration , 2018, J. Artif. Intell. Res..

[24]  M. de Rijke,et al.  Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem , 2013, ICML.

[25]  Michael I. Jordan,et al.  On the Theory of Reinforcement Learning with Once-per-Episode Feedback , 2021, ArXiv.

[26]  Chi Jin,et al.  Near-Optimal Reinforcement Learning with Self-Play , 2020, NeurIPS.

[27]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[28]  Michal Valko,et al.  Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , 2021, ALT.

[29]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[30]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[31]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[32]  Fabrice Clérot,et al.  A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits , 2015, ICML.

[33]  Csaba Szepesvári,et al.  Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods , 2007, UAI.

[34]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[35]  Johannes Fürnkranz,et al.  A Survey of Preference-Based Reinforcement Learning Methods , 2017, J. Mach. Learn. Res..

[36]  Xiangyang Ji,et al.  Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function , 2019, NeurIPS.

[37]  Johannes Fürnkranz,et al.  Model-Free Preference-Based Reinforcement Learning , 2016, AAAI.

[38]  Jan Peters,et al.  Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[39]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[40]  Qinghua Liu,et al.  A Sharp Analysis of Model-based Reinforcement Learning with Self-Play , 2020, ICML.

[41]  Thomas P. Hayes,et al.  High-Probability Regret Bounds for Bandit Online Linear Optimization , 2008, COLT.

[42]  Markus Wulfmeier,et al.  Maximum Entropy Deep Inverse Reinforcement Learning , 2015, 1507.04888.

[43]  Thorsten Joachims,et al.  The K-armed Dueling Bandits Problem , 2012, COLT.

[44]  Thorsten Joachims,et al.  Learning Trajectory Preferences for Manipulators via Iterative Improvement , 2013, NIPS.

[45]  M. de Rijke,et al.  Relative confidence sampling for efficient on-line ranker evaluation , 2014, WSDM.

[46]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[47]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[48]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[49]  Shie Mannor,et al.  Reinforcement Learning with Trajectory Feedback , 2020, ArXiv.

[50]  Ruosong Wang,et al.  Preference-based Reinforcement Learning with Finite-Time Guarantees , 2020, NeurIPS.

[51]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.