Improving Language Models with Advantage-based Offline Policy Gradients

Improving language model generations according to some user-defined quality or style constraints is challenging. Typical approaches include learning on additional human-written data, filtering ``low-quality'' data using heuristics and/or using reinforcement learning with human feedback (RLHF). However, filtering can remove valuable training signals, whereas data collection and RLHF constantly require additional human-written or LM exploration data which can be costly to obtain. A natural question to ask is ``Can we leverage RL to optimize LM utility on existing crowd-sourced and internet data?'' To this end, we present Left-over Lunch RL (LoL-RL), a simple training algorithm that uses offline policy gradients for learning language generation tasks as a 1-step RL game. LoL-RL can finetune LMs to optimize arbitrary classifier-based or human-defined utility functions on any sequence-to-sequence data. Experiments with five different language generation tasks using models of varying sizes and multiple rewards show that models trained with LoL-RL can consistently outperform the best supervised learning models. We also release our experimental code. https://github.com/abaheti95/LoL-RL

[1]  Peter J. Liu,et al.  SLiC-HF: Sequence Likelihood Calibration with Human Feedback , 2023, ArXiv.

[2]  Kang Min Yoo,et al.  Critic-Guided Decoding for Controlled Text Generation , 2022, ACL.

[3]  Yejin Choi,et al.  Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts , 2022, ACL.

[4]  Barbara Plank The “Problem” of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation , 2022, EMNLP.

[5]  Yejin Choi,et al.  Generating Sequences by Learning to Self-Correct , 2022, ICLR.

[6]  Yejin Choi,et al.  Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization , 2022, ArXiv.

[7]  Jesse Dodge,et al.  Measuring the Carbon Intensity of AI in Cloud Instances , 2022, FAccT.

[8]  S. Levine,et al.  Offline RL for Natural Language Generation with Implicit Language Q Learning , 2022, ICLR.

[9]  Yejin Choi,et al.  Quark: Controllable Text Generation with Reinforced Unlearning , 2022, NeurIPS.

[10]  Siva Reddy,et al.  FaithDial: A Faithful Benchmark for Information-Seeking Dialogue , 2022, Transactions of the Association for Computational Linguistics.

[11]  Mo Yu,et al.  On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models? , 2022, NAACL.

[12]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[13]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[14]  Ronan Le Bras,et al.  Symbolic Knowledge Distillation: from General Language Models to Commonsense Models , 2021, NAACL.

[15]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[16]  Alan Ritter,et al.  Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts , 2021, EMNLP.

[17]  Yejin Choi,et al.  DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts , 2021, ACL.

[18]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[19]  Yejin Choi,et al.  Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics , 2020, EMNLP.

[20]  Xiang Gao,et al.  Dialogue Response Ranking Training with Large-Scale Human Feedback Data , 2020, EMNLP.

[21]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[22]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[23]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[24]  John X. Morris,et al.  TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP , 2020, EMNLP.

[25]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[26]  Ronan Le Bras,et al.  Adversarial Filters of Dataset Biases , 2020, ICML.

[27]  Hanna M. Wallach,et al.  Measurement and Fairness , 2019, FAccT.

[28]  Jianfeng Gao,et al.  DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation , 2019, ACL.

[29]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[30]  Yejin Choi,et al.  COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , 2019, ACL.

[31]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[32]  Jason Weston,et al.  What makes a good conversation? How controllable attributes affect human judgments , 2019, NAACL.

[33]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[34]  J. Weston,et al.  Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[35]  Alan Ritter,et al.  Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints , 2018, EMNLP.

[36]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[37]  Mohit Bansal,et al.  Polite Dialogue Generation Without Parallel Data , 2018, TACL.

[38]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[39]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[40]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[41]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[42]  Olivier Buffet,et al.  Policy‐Gradient Algorithms , 2013 .

[43]  Christopher Potts,et al.  Did It Happen? The Pragmatic Complexity of Veridicality Assessment , 2012, CL.

[44]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[45]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[46]  Mark B. Ring CHILD: A First Step Towards Continual Learning , 1997, Machine Learning.

[47]  Dmitrii Krasheninnikov,et al.  Defining and Characterizing Reward Gaming , 2022, NeurIPS.

[48]  S. Levine,et al.  Should I Run Offline Reinforcement Learning or Behavioral Cloning? , 2022, ICLR.

[49]  Kee-Eung Kim,et al.  GPT-Critic: Offline Reinforcement Learning for End-to-End Task-Oriented Dialogue Systems , 2022, ICLR.

[50]  Yejin Choi,et al.  Understanding Dataset Difficulty with V-Usable Information , 2021, ICML.

[51]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[52]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .