Reinforced Self-Training (ReST) for Language Modeling

Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences. We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST). Given an initial LLM policy, ReST produces a dataset by generating samples from the policy, which are then used to improve the LLM policy using offline RL algorithms. ReST is more efficient than typical online RLHF methods because the training dataset is produced offline, which allows data reuse. While ReST is a general approach applicable to all generative learning settings, we focus on its application to machine translation. Our results show that ReST can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.

[1]  Christopher D. Manning,et al.  Direct Preference Optimization: Your Language Model is Secretly a Reward Model , 2023, NeurIPS.

[2]  Nicolas Papernot,et al.  The Curse of Recursion: Training on Generated Data Makes Models Forget , 2023, ArXiv.

[3]  Yejin Choi,et al.  Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing , 2023, ArXiv.

[4]  T. Zhang,et al.  RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment , 2023, ArXiv.

[5]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[6]  Geoffrey Irving,et al.  Solving math word problems with process- and outcome-based feedback , 2022, ArXiv.

[7]  J. Schulman,et al.  Scaling Laws for Reward Model Overoptimization , 2022, ICML.

[8]  Lisa Anne Hendricks,et al.  Improving alignment of dialogue agents via targeted human judgements , 2022, ArXiv.

[9]  Dmitrii Krasheninnikov,et al.  Defining and Characterizing Reward Hacking , 2022, ArXiv.

[10]  Chris Dyer,et al.  MAD for Robust Reinforcement Learning in Machine Translation , 2022, ArXiv.

[11]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[12]  R. Laroche,et al.  When does return-conditioned supervised learning work for offline reinforcement learning? , 2022, NeurIPS.

[13]  Tom B. Brown,et al.  Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.

[14]  Noah D. Goodman,et al.  STaR: Bootstrapping Reasoning With Reasoning , 2022, NeurIPS.

[15]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[16]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[17]  Geoffrey Irving,et al.  Red Teaming Language Models with Language Models , 2022, EMNLP.

[18]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[19]  Jan Leike,et al.  Recursively Summarizing Books with Human Feedback , 2021, ArXiv.

[20]  Markus Freitag,et al.  Scaling Laws for Neural Machine Translation , 2021, ICLR.

[21]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[22]  Siqi Liu,et al.  Launchpad: A Programming Model for Distributed Machine Learning Research , 2021, ArXiv.

[23]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[24]  Oriol Vinyals,et al.  Machine Translation Decoding beyond Beam Search , 2021, EMNLP.

[25]  Razvan Pascanu,et al.  Regularized Behavior Value Estimation , 2021, ArXiv.

[26]  Olivier Pietquin,et al.  Supervised Seeded Iterated Learning for Interactive Language Learning , 2020, EMNLP.

[27]  Alon Lavie,et al.  COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.

[28]  Ryan J. Lowe,et al.  Learning to summarize from human feedback , 2020, NeurIPS 2020.

[29]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[30]  Justin Fu,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[31]  M. Utiyama,et al.  Self-Training for Unsupervised Neural Machine Translation in Unbalanced Training Data Scenarios , 2020, NAACL.

[32]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[33]  Aaron C. Courville,et al.  Countering Language Drift with Seeded Iterated Learning , 2020, ICML.

[34]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Marc'Aurelio Ranzato,et al.  Revisiting Self-Training for Neural Sequence Generation , 2019, ICLR.

[36]  H. Francis Song,et al.  V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.

[37]  Satinder Singh,et al.  Self-Imitation Learning , 2018, ICML.

[38]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[39]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[40]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Jiajun Zhang,et al.  Exploiting Source-side Monolingual Data in Neural Machine Translation , 2016, EMNLP.

[43]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[44]  S. Kirby,et al.  Iterated learning and the evolution of language , 2014, Current Opinion in Neurobiology.

[45]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[46]  James R. Curran,et al.  Bootstrapping POS-taggers using unlabelled data , 2003, CoNLL.

[47]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[48]  H. J. Scudder,et al.  Probability of error of some adaptive pattern-recognition machines , 1965, IEEE Trans. Inf. Theory.

[49]  R. Bellman A Markovian Decision Process , 1957 .

[50]  George F. Foster,et al.  Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust , 2022, WMT.

[51]  S. Levine,et al.  Should I Run Offline Reinforcement Learning or Behavioral Cloning? , 2022, ICLR.

[52]  Marc G. Bellemare,et al.  Beyond Tabula Rasa: Reincarnating Reinforcement Learning , 2022, ArXiv.

[53]  Lisa Anne Hendricks,et al.  An empirical analysis of compute-optimal large language model training , 2022, NeurIPS.

[54]  Sandy H. Huang,et al.  On Multi-objective Policy Optimization as a Tool for Reinforcement Learning , 2021, ArXiv.

[55]  A. Lavie,et al.  Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain , 2021, WMT.

[56]  He He,et al.  Text Generation by Learning from Demonstrations , 2020, ArXiv.

[57]  Srivatsan Srinivasan,et al.  The DeepMind Chinese–English Document Translation System at WMT2020 , 2020, WMT.

[58]  Philipp Koehn,et al.  Findings of the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment , 2020, WMT.

[59]  Marcello Federico,et al.  Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.

[60]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[61]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .