Learning to summarize from human feedback

As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about---summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.

[1]  Richard E. Turner,et al.  Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control , 2016, ICML.

[2]  Iryna Gurevych,et al.  Reward Learning for Efficient Reinforcement Learning in Extractive Document Summarisation , 2019, IJCAI.

[3]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[4]  Douglas Eck,et al.  Tuning Recurrent Neural Networks with Reinforcement Learning , 2016, ICLR.

[5]  Yao Zhao,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.

[6]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[7]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[8]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[9]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[10]  Oleg O. Sushkov,et al.  Scaling data-driven robotics with reward sketching and batch reinforcement learning , 2019, Robotics: Science and Systems.

[11]  Yang Fang,et al.  Abstract Text Summarization with a Convolutional Seq2seq Model , 2019, Applied Sciences.

[12]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[13]  Richard Socher,et al.  Neural Text Summarization: A Critical Evaluation , 2019, EMNLP.

[14]  Garrison W. Cottrell,et al.  Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[15]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[16]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[17]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[18]  Jianfeng Gao,et al.  Towards Coherent and Cohesive Long-form Text Generation , 2018, Proceedings of the First Workshop on Narrative Understanding.

[19]  Shashi Narayan,et al.  Leveraging Pre-trained Checkpoints for Sequence Generation Tasks , 2019, Transactions of the Association for Computational Linguistics.

[20]  Jason Weston,et al.  Neural Text Generation with Unlikelihood Training , 2019, ICLR.

[21]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[22]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[25]  Norbert Fuhr,et al.  Optimum polynomial retrieval functions based on the probability ranking principle , 1989, TOIS.

[26]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[27]  Mark O. Riedl,et al.  Controllable Neural Story Plot Generation via Reward Shaping , 2018, IJCAI.

[28]  Yuxiang Wu,et al.  Learning to Extract Coherent Summary via Deep Reinforcement Learning , 2018, AAAI.

[29]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[30]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[31]  Alexander M. Rush,et al.  Abstractive Sentence Summarization with Attentive Recurrent Neural Networks , 2016, NAACL.

[32]  Mohit Bansal,et al.  Polite Dialogue Generation Without Parallel Data , 2018, TACL.

[33]  Richard M. Schwartz,et al.  Hedge Trimmer: A Parse-and-Trim Approach to Headline Generation , 2003, HLT-NAACL 2003.

[34]  Jiusheng Chen,et al.  ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training , 2020, EMNLP.

[35]  Shane Legg,et al.  Reward learning from human preferences and demonstrations in Atari , 2018, NeurIPS.

[36]  Ke Xu,et al.  Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models , 2020, AAAI.

[37]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[38]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[39]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[40]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[41]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[42]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[43]  Jason Weston,et al.  ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons , 2019, ArXiv.

[44]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[45]  Natalie Schluter,et al.  The limits of automatic summarisation according to ROUGE , 2017, EACL.

[46]  R. Likert “Technique for the Measurement of Attitudes, A” , 2022, The SAGE Encyclopedia of Research Design.

[47]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Mark O. Riedl,et al.  Controllable Neural Story Generation via Reinforcement Learning , 2018, ArXiv.

[50]  Ido Dagan,et al.  Better Rewards Yield Better Summaries: Learning to Summarise Without References , 2019, EMNLP.

[51]  Dilek Z. Hakkani-Tür,et al.  Towards Coherent and Engaging Spoken Dialog Response Generation Using Automatic Conversation Evaluators , 2019, INLG.

[52]  Jackie Chi Kit Cheung,et al.  BanditSum: Extractive Summarization as a Contextual Bandit , 2018, EMNLP.

[53]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[54]  Hal Daumé,et al.  Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback , 2017, EMNLP.

[55]  Florian Schmidt Generalization in Generation: A closer look at Exposure Bias , 2019, NGT@EMNLP-IJCNLP.

[56]  Shahram Khadivi,et al.  Can Neural Machine Translation be Improved with User Feedback? , 2018, NAACL.

[57]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[58]  Stefan Riezler,et al.  Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback , 2018, ACL.

[59]  Daphne Ippolito,et al.  Trading Off Diversity and Quality in Natural Language Generation , 2020, HUMEVAL.

[60]  Sanja Fidler,et al.  Teaching Machines to Describe Images via Natural Language Feedback , 2017, ArXiv.

[61]  Dario Amodei,et al.  Supervising strong learners by amplifying weak experts , 2018, ArXiv.

[62]  Benno Stein,et al.  TL;DR: Mining Reddit to Learn Automatic Summarization , 2017, NFiS@EMNLP.

[63]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[64]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[65]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[66]  Percy Liang,et al.  The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.

[67]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[68]  Anca D. Dragan,et al.  Reward-rational (implicit) choice: A unifying formalism for reward learning , 2020, NeurIPS.

[69]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[70]  Jason Weston,et al.  Learning from Dialogue after Deployment: Feed Yourself, Chatbot! , 2019, ACL.

[71]  Shane Legg,et al.  Scalable agent alignment via reward modeling: a research direction , 2018, ArXiv.

[72]  Jason Weston,et al.  Finding Generalizable Evidence by Learning to Convince Q&A Models , 2019, EMNLP.

[73]  Alec Radford,et al.  Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.