Scaling Laws and Interpretability of Learning from Repeated Data

Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintention-ally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a fam-ily of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model’s capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work — attempting to reverse engineer the detailed computations performed by the model — by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in large language models could lead to disproportionately large harms to performance.

[1]  Tom B. Brown,et al.  In-context Learning and Induction Heads , 2022, ArXiv.

[2]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[3]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[4]  Tom B. Brown,et al.  Predictability and Surprise in Large Generative Models , 2022, FAccT.

[5]  Nicholas Carlini,et al.  Deduplicating Training Data Makes Language Models Better , 2021, ACL.

[6]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[7]  Dario Amodei,et al.  A General Language Assistant as a Laboratory for Alignment , 2021, ArXiv.

[8]  Ethan Caballero,et al.  Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers , 2021, ArXiv.

[9]  Jasha Droppo,et al.  Scaling Laws for Acoustic Models , 2021, Interspeech.

[10]  Alec Radford,et al.  Multimodal Neurons in Artificial Neural Networks , 2021 .

[11]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[12]  Tom Henighan,et al.  Scaling Laws for Transfer , 2021, ArXiv.

[13]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[14]  Mark Chen,et al.  Scaling Laws for Autoregressive Generative Modeling , 2020, ArXiv.

[15]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[16]  Tom B. Brown,et al.  Measuring the Algorithmic Efficiency of Neural Networks , 2020, ArXiv.

[17]  Chenliang Li,et al.  PALM: Pre-training an Autoencoding&autoregressive Language Model for Context-conditioned Generation , 2020, EMNLP.

[18]  Nick Cammarata,et al.  Thread: Circuits , 2020, Distill.

[19]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[20]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[21]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[22]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[23]  Levent Sagun,et al.  The jamming transition as a paradigm to understand the loss landscape of deep neural networks , 2018, Physical review. E.

[24]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[25]  Yang Yang,et al.  Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[26]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[27]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[28]  M. Opper Statistical Mechanics of Learning : Generalization , 2002 .

[29]  Manfred Opper,et al.  A Variational Approach to Learning Curves , 2001, NIPS.