Ranking and Tuning Pre-trained Models: A New Paradigm for Exploiting Model Hubs

Model hubs with many pre-trained models (PTMs) have become a cornerstone of deep learning. Although built at a high cost, they remain under-exploited —practitioners usually pick one PTM from the provided model hub by popularity and then fine-tune the PTM to solve the target task. This na¨ıve but common practice poses two obstacles to full exploitation of pre-trained model hubs: first, the PTM selection by popularity has no optimality guarantee, and second, only one PTM is used while the remaining PTMs are ignored. An alternative might be to consider all possible combinations of PTMs and extensively fine-tune each combination, but this would not only be prohibitive computationally but may also lead to statistical over-fitting. In this paper, we propose a new paradigm for exploiting model hubs that is intermediate between these extremes. The paradigm is characterized by two aspects: (1) We use an evidence maximization procedure to estimate the maximum value of label evidence given features extracted by pre-trained models. This procedure can rank all the PTMs in a model hub for various types of PTMs and tasks before fine-tuning . (2) The best ranked PTM can either be fine-tuned and deployed if we have no preference for the model’s architecture or the target PTM can be tuned by the top K ranked PTMs via a Bayesian procedure that we propose. This procedure, which we refer to as B-Tuning , not only improves upon specialized methods designed for tuning homogeneous PTMs, but also applies to the challenging problem of tuning heterogeneous PTMs where it yields a new level of benchmark performance.

[1]  Jianmin Wang,et al.  From Big to Small: Adaptive Learning to Partial-Set Domains , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Hiroaki Hayashi,et al.  Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing , 2021, ACM Comput. Surv..

[3]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[4]  Zhangjie Cao,et al.  Zoo-Tuning: Adaptive Transfer from a Zoo of Models , 2021, ICML.

[5]  Zhiyuan Liu,et al.  Pre-Trained Models: Past, Present and Future , 2021, AI Open.

[6]  A. Dosovitskiy,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[7]  Gunnar Ratsch,et al.  Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning , 2021, ICML.

[8]  Mingsheng Long,et al.  LogME: Practical Assessment of Pre-trained Models for Transfer Learning , 2021, ICML.

[9]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[10]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Behnam Neyshabur,et al.  What is being transferred in transfer learning? , 2020, NeurIPS.

[13]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[14]  Chen Sun,et al.  What makes for good views for contrastive learning , 2020, NeurIPS.

[15]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[16]  Xipeng Qiu,et al.  Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.

[17]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[18]  Tal Hassner,et al.  LEEP: A New Measure to Evaluate Transferability of Learned Representations , 2020, ICML.

[19]  Stefano Soatto,et al.  Rethinking the Hyperparameters for Fine-tuning , 2020, ICLR.

[20]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[21]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Model Fusion via Optimal Transport , 2019, NeurIPS.

[23]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[24]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[25]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[26]  J. Leskovec,et al.  Strategies for Pre-training Graph Neural Networks , 2019, ICLR.

[27]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[28]  Mingsheng Long,et al.  Stochastic Normalization , 2020, NeurIPS.

[29]  Mingsheng Long,et al.  Co-Tuning for Transfer Learning , 2020, NeurIPS.

[30]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[31]  André Susano Pinto,et al.  A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark , 2019, 1910.04867.

[32]  Tal Hassner,et al.  Transferability and Hardness of Supervised Classification Tasks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[34]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[35]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[36]  Bo Chen,et al.  MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  Xinyang Chen,et al.  Catastrophic Forgetting Meets Negative Transfer: Batch Spectral Shrinkage for Safe Transfer Learning , 2019, NeurIPS.

[40]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[41]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[43]  Xuhong Li,et al.  Explicit Inductive Bias for Transfer Learning with Convolutional Networks , 2018, ICML.

[44]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[47]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[48]  Jean Daunizeau,et al.  Semi-analytical approximations to statistical moments of sigmoid and softmax mappings of normal variables , 2017, 1703.00091.

[49]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[50]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Jinpeng Huai,et al.  On Hyper-Parameter Estimation In Empirical Bayes: A Revisit of The MacKay Algorithm , 2016, UAI.

[52]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[53]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[54]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[57]  Michael I. Jordan,et al.  Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[58]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[59]  Michael Habeck,et al.  Bayesian evidence and model selection , 2014, Digit. Signal Process..

[60]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[61]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[63]  Sebastiano Vigna,et al.  A Weighted Correlation Index for Rankings with Ties , 2014, WWW.

[64]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[65]  Seung Woo Lee,et al.  Birdsnap: Large-Scale Fine-Grained Visual Categorization of Birds , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[67]  Iasonas Kokkinos,et al.  Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[68]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[69]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[70]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[71]  Subhransu Maji,et al.  Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[72]  Jonathan Krause,et al.  Collecting a Large-scale Dataset of Fine-grained Cars , 2013 .

[73]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[74]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[75]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[76]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[77]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[78]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[79]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[80]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[81]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[82]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[83]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[84]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[85]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[86]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[87]  Shai Ben-David,et al.  Exploiting Task Relatedness for Mulitple Task Learning , 2003, COLT.

[88]  Wei Tang,et al.  Ensembling neural networks: Many could be better than all , 2002, Artif. Intell..

[89]  Sebastian Thrun,et al.  Learning to Learn: Introduction and Overview , 1998, Learning to Learn.

[90]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[91]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[92]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[93]  Stephen F. Gull,et al.  Developments in Maximum Entropy Data Analysis , 1989 .

[94]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[95]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .