Scaling Laws vs Model Architectures : How does Inductive Bias Influence Scaling? An Extensive Empirical Study on Language Tasks

There have been a lot of interest in the scal- 001 ing properties of Transformer models (Kaplan 002 et al., 2020). However, not much has been 003 done on the front of investigating the effect 004 of scaling properties of different inductive bi- 005 ases and model architectures. Do model ar- 006 chitectures scale differently? If so, how does 007 inductive bias affect scaling behaviour? How 008 does this influence upstream (pretraining) and 009 downstream (transfer)? This paper conducts 010 a systematic study of scaling behaviour of ten 011 diverse model architectures such as Transform- 012 ers, Switch Transformers, Universal Trans- 013 formers, Dynamic convolutions, Performers, 014 and recently proposed MLP-Mixers. Via ex- 015 tensive experiments, we show that (1) archi- 016 tecture is an indeed an important considera- 017 tion when performing scaling and (2) the best 018 performing model can fluctuate at different 019 scales. We believe that the findings outlined in 020 this work has significant implications to how 021 model architectures are currently evaluated in 022 the community. 023

[1]  Giuseppe Attardi,et al.  Language Modeling , 2013 .

[2]  Ian S. Dunn,et al.  Exploring the Limits , 2009 .