论文信息 - Improving Language Models with Advantage-based Offline Policy Gradients - 字舞流文

Improving Language Models with Advantage-based Offline Policy Gradients

Improving language model generations according to some user-defined quality or style constraints is challenging. Typical approaches include learning on additional human-written data, filtering ``low-quality'' data using heuristics and/or using reinforcement learning with human feedback (RLHF). However, filtering can remove valuable training signals, whereas data collection and RLHF constantly require additional human-written or LM exploration data which can be costly to obtain. A natural question to ask is ``Can we leverage RL to optimize LM utility on existing crowd-sourced and internet data?'' To this end, we present Left-over Lunch RL (LoL-RL), a simple training algorithm that uses offline policy gradients for learning language generation tasks as a 1-step RL game. LoL-RL can finetune LMs to optimize arbitrary classifier-based or human-defined utility functions on any sequence-to-sequence data. Experiments with five different language generation tasks using models of varying sizes and multiple rewards show that models trained with LoL-RL can consistently outperform the best supervised learning models. We also release our experimental code. https://github.com/abaheti95/LoL-RL

Ronan Le Bras | Mark O. Riedl | Ashutosh Baheti | Ximing Lu | Maarten Sap | Faeze Brahman

[1] Peter J. Liu,et al. SLiC-HF: Sequence Likelihood Calibration with Human Feedback , 2023, ArXiv.

[2] Kang Min Yoo,et al. Critic-Guided Decoding for Controlled Text Generation , 2022, ACL.

[3] Yejin Choi,et al. Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts , 2022, ACL.

[4] Barbara Plank. The “Problem” of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation , 2022, EMNLP.

[5] Yejin Choi,et al. Generating Sequences by Learning to Self-Correct , 2022, ICLR.

[6] Yejin Choi,et al. Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization , 2022, ArXiv.

[7] Jesse Dodge,et al. Measuring the Carbon Intensity of AI in Cloud Instances , 2022, FAccT.

[8] S. Levine,et al. Offline RL for Natural Language Generation with Implicit Language Q Learning , 2022, ICLR.

[9] Yejin Choi,et al. Quark: Controllable Text Generation with Reinforced Unlearning , 2022, NeurIPS.

[10] Siva Reddy,et al. FaithDial: A Faithful Benchmark for Information-Seeking Dialogue , 2022, Transactions of the Association for Computational Linguistics.

[11] Mo Yu,et al. On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models? , 2022, NAACL.

[12] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[13] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[14] Ronan Le Bras,et al. Symbolic Knowledge Distillation: from General Language Models to Commonsense Models , 2021, NAACL.

[15] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[16] Alan Ritter,et al. Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts , 2021, EMNLP.

[17] Yejin Choi,et al. DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts , 2021, ACL.

[18] Yejin Choi,et al. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[19] Yejin Choi,et al. Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics , 2020, EMNLP.

[20] Xiang Gao,et al. Dialogue Response Ranking Training with Large-Scale Human Feedback Data , 2020, EMNLP.

[21] Nando de Freitas,et al. Critic Regularized Regression , 2020, NeurIPS.

[22] S. Levine,et al. Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[23] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[24] John X. Morris,et al. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP , 2020, EMNLP.

[25] Doug Downey,et al. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[26] Ronan Le Bras,et al. Adversarial Filters of Dataset Biases , 2020, ICML.

[27] Hanna M. Wallach,et al. Measurement and Fairness , 2019, FAccT.

[28] Jianfeng Gao,et al. DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation , 2019, ACL.

[29] Lav R. Varshney,et al. CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[30] Yejin Choi,et al. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , 2019, ACL.

[31] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[32] Jason Weston,et al. What makes a good conversation? How controllable attributes affect human judgments , 2019, NAACL.

[33] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[34] J. Weston,et al. Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[35] Alan Ritter,et al. Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints , 2018, EMNLP.

[36] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[37] Mohit Bansal,et al. Polite Dialogue Generation Without Parallel Data , 2018, TACL.

[38] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[39] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[40] Karin M. Verspoor,et al. Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[41] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[42] Olivier Buffet,et al. Policy‐Gradient Algorithms , 2013 .

[43] Christopher Potts,et al. Did It Happen? The Pragmatic Complexity of Veridicality Assessment , 2012, CL.

[44] Martha White,et al. Linear Off-Policy Actor-Critic , 2012, ICML.

[45] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[46] Mark B. Ring. CHILD: A First Step Towards Continual Learning , 1997, Machine Learning.

[47] Dmitrii Krasheninnikov,et al. Defining and Characterizing Reward Gaming , 2022, NeurIPS.

[48] S. Levine,et al. Should I Run Offline Reinforcement Learning or Behavioral Cloning? , 2022, ICLR.

[49] Kee-Eung Kim,et al. GPT-Critic: Offline Reinforcement Learning for End-to-End Task-Oriented Dialogue Systems , 2022, ICLR.

[50] Yejin Choi,et al. Understanding Dataset Difficulty with V-Usable Information , 2021, ICML.

[51] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[52] Danqi Chen,et al. of the Association for Computational Linguistics: , 2001 .