Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs—we call the results “model soups.” When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pretrained on JFT, our soup recipe provides signifi-cant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https://github. com/mlfoundations/model-soups . a CLIP and ImageNet-22k pre-trained ViT-B/32 model, we use the best hyperparameters found by the hyperparameter sweep to fine-tune multiple times, varying only the random seed. For an experimental budget of 1 ≤ k ≤ 5 models, we show: (i) the individual model with random seed k , (ii) the model soup composed of models with random seeds 1 through k , and (iii) the ensemble composed of models with random seeds 1 through k .

[1]  Joan Puigcerver,et al.  Deep Ensembles for Low-Data Transfer Learning , 2020, ArXiv.

[2]  D. Song,et al.  The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  S. Gelly,et al.  Big Transfer (BiT): General Visual Representation Learning , 2019, ECCV.

[4]  Aaron Klein,et al.  Towards Automatically-Tuned Neural Networks , 2016, AutoML@ICML.

[5]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[6]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[7]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[8]  Christian Gagné,et al.  Bayesian Hyperparameter Optimization for Ensemble Learning , 2016, UAI.

[9]  Roy Bar-Haim,et al.  The Second PASCAL Recognising Textual Entailment Challenge , 2006 .

[10]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[11]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[12]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[13]  Kai Zou,et al.  EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks , 2019, EMNLP.

[14]  Iasonas Kokkinos,et al.  Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Rich Caruana,et al.  Getting the Most Out of Ensemble Selection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[18]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  M. Mozer,et al.  Mitigating bias in calibration error estimation , 2020, AISTATS.

[20]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[21]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[22]  Percy Liang,et al.  Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution , 2022, ICLR.

[23]  Sara Beery,et al.  The iWildCam 2021 Competition Dataset , 2021, ArXiv.

[24]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Thomas G. Dietterich Ensemble Methods in Machine Learning , 2000, Multiple Classifier Systems.

[26]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[27]  Daniel M. Roy,et al.  Linear Mode Connectivity and the Lottery Ticket Hypothesis , 2019, ICML.

[28]  Pang Wei Koh,et al.  WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.

[29]  Dawn Song,et al.  Natural Adversarial Examples , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[31]  Eric P. Xing,et al.  Learning Robust Global Representations by Penalizing Local Predictive Power , 2019, NeurIPS.

[32]  Behnam Neyshabur,et al.  The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning , 2021, Trans. Mach. Learn. Res..

[33]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[34]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[35]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[36]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[37]  Jasper Snoek,et al.  Hyperparameter Ensembles for Robustness and Uncertainty Quantification , 2020, NeurIPS.

[38]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[39]  Behnam Neyshabur,et al.  What is being transferred in transfer learning? , 2020, NeurIPS.

[40]  Raphael Gontijo Lopes,et al.  No One Representation to Rule Them All: Overlapping Features of Training Methods , 2021, ICLR.

[41]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[42]  Geoffrey E. Hinton,et al.  Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[43]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[44]  Boris Katz,et al.  ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , 2019, NeurIPS.

[45]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[46]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[47]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[48]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[49]  Rich Caruana,et al.  Ensemble selection from libraries of models , 2004, ICML.

[50]  Quoc V. Le,et al.  Combined Scaling for Zero-shot Transfer Learning , 2021, ArXiv.

[51]  Benjamin Recht,et al.  Evaluating Machine Accuracy on ImageNet , 2020, ICML.

[52]  Cordelia Schmid,et al.  Optimized Generic Feature Learning for Few-shot Classification across Domains , 2020, ArXiv.

[53]  Colin Raffel,et al.  Merging Models with Fisher-Weighted Averaging , 2021, ArXiv.

[54]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[55]  Learning to Prompt for Vision-Language Models , 2021, ArXiv.

[56]  Xuhong Li,et al.  Explicit Inductive Bias for Transfer Learning with Convolutional Networks , 2018, ICML.

[57]  Kan Chen,et al.  Billion-scale semi-supervised learning for image classification , 2019, ArXiv.

[58]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Matthieu Guillaumin,et al.  Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[60]  Jong Wook Kim,et al.  Robust fine-tuning of zero-shot models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[62]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[63]  Rohan Anil,et al.  Knowledge distillation: A good teacher is patient and consistent , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Quoc V. Le,et al.  CoAtNet: Marrying Convolution and Attention for All Data Sizes , 2021, NeurIPS.

[65]  Ariel Kleiner,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.

[66]  Gordon Christie,et al.  Functional Map of the World , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[67]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[68]  Jonathan Krause,et al.  3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[69]  Ali Farhadi,et al.  Label Refinery: Improving ImageNet Classification through Label Progression , 2018, ArXiv.

[70]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[71]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[72]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[73]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[74]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[75]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[76]  Peng Gao,et al.  CLIP-Adapter: Better Vision-Language Models with Feature Adapters , 2021, Int. J. Comput. Vis..

[77]  Matthijs Douze,et al.  Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[78]  Krista A. Ehinger,et al.  SUN Database: Exploring a Large Collection of Scene Categories , 2014, International Journal of Computer Vision.

[79]  Alexander Kolesnikov,et al.  Scaling Vision Transformers , 2021, ArXiv.

[80]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[81]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[82]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[83]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[84]  Atsuto Maki,et al.  From generic to specific deep representations for visual recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[85]  Jitendra Malik,et al.  Analyzing the Performance of Multilayer Neural Networks for Object Recognition , 2014, ECCV.

[86]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[87]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[88]  Haoyi Xiong,et al.  RIFLE: Backpropagation in Depth for Deep Transfer Learning through Re-Initializing the Fully-connected LayEr , 2020, ICML.

[89]  Peng Gao,et al.  Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling , 2021, ArXiv.

[90]  Rogério Schmidt Feris,et al.  SpotTune: Transfer Learning Through Adaptive Fine-Tuning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[91]  Saining Xie,et al.  SLIP: Self-supervision meets Language-Image Pre-training , 2021, ECCV.

[92]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[93]  Zhangjie Cao,et al.  Zoo-Tuning: Adaptive Transfer from a Zoo of Models , 2021, ICML.

[94]  Xiaohua Zhai,et al.  Are we done with ImageNet? , 2020, ArXiv.

[95]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[96]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.