Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning

Transfer-learning methods aim to improve performance in a data-scarce target domain using a model pretrained on a data-rich source domain. A cost-efficient strategy, linear probing , involves freezing the source model and training a new classification head for the target domain. This strategy is outperformed by a more costly but state-of-the-art method— fine-tuning all parameters of the source model to the target domain—possibly because fine-tuning allows the model to leverage useful information from intermediate layers which is otherwise discarded by the previously trained later layers. We explore the hypothesis that these intermediate layers might be directly exploited. We propose a method, Head-to-Toe probing (H EAD 2T OE ), that selects features from all layers of the source model to train a classification head for the target domain. In evaluations on the Visual Task Adaptation Benchmark (VTAB), Head2Toe matches performance obtained with fine-tuning on average while reducing training and storage cost a hundred fold or more, but criti-cally, for out-of-distribution transfer, Head2Toe outperforms fine-tuning 1 . we demonstrate the effectiveness of group lasso on identifying relevant intermediate features of a ResNet-50 trained on ImageNet. We rank all features by their relevance score, s i , and select groups of 2048 consecutive features beginning at a particular offset in this ranking. Offset 0 corresponds to selecting the features with largest relevance. We calculate average test accuracies across all VTAB tasks. As the figure shows, test accuracy decreases monotonically with the offset, indicating that the relevance score predicts the importance of including a feature in the linear classifier.

[1]  Raphael Gontijo Lopes,et al.  No One Representation to Rule Them All: Overlapping Features of Training Methods , 2021, ICLR.

[2]  Universal Paralinguistic Speech Representations Using Self-Supervised Conformers , 2021, ArXiv.

[3]  Mykola Pechenizkiy,et al.  Quick and robust feature selection: the strength of energy-efficient sparse training for autoencoders , 2020, Machine Learning.

[4]  Timothy M. Hospedales,et al.  Meta-Learning in Neural Networks: A Survey , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Andreas Geiger,et al.  Projected GANs Converge Faster , 2021, NeurIPS.

[6]  Behnam Neyshabur,et al.  Deep Learning Through the Lens of Example Difficulty , 2021, NeurIPS.

[7]  Aaron C. Courville,et al.  Can Subnetwork Structure be the Key to Out-of-Distribution Generalization? , 2021, ICML.

[8]  R. Zemel,et al.  Learning a Universal Template for Few-shot Dataset Generalization , 2021, ICML.

[9]  Xiaohua Zhai,et al.  Comparing Transfer and Meta Learning Approaches on a Unified Few-Shot Classification Benchmark , 2021, ArXiv.

[10]  Hakan Bilen,et al.  Universal Representation Learning from Multiple Domains for Few-shot Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Andrew Gordon Wilson,et al.  Fast Adaptation with Linearized Neural Networks , 2021, AISTATS.

[12]  Alexander M. Rush,et al.  Parameter-Efficient Transfer Learning with Diff Pruning , 2020, ACL.

[13]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[14]  Joan Puigcerver,et al.  Scalable Transfer Learning with Expert Models , 2020, ICLR.

[15]  Lu Liu,et al.  A Universal Representation Transformer Layer for Few-Shot Image Classification , 2020, ICLR.

[16]  Hui Xiong,et al.  A Comprehensive Survey on Transfer Learning , 2019, Proceedings of the IEEE.

[17]  Xinyun Chen,et al.  Learn-to-Share: A Hardware-friendly Transfer Learning Framework Exploiting Computation and Parameter Sharing , 2021, ICML.

[18]  Hakan Bilen,et al.  Improving Task Adaptation for Cross-domain Few-shot Learning , 2021, ArXiv.

[19]  Joan Puigcerver,et al.  Deep Ensembles for Low-Data Transfer Learning , 2020, ArXiv.

[20]  David P. Kreil,et al.  Cross-Domain Few-Shot Learning by Representation Fusion , 2020, ArXiv.

[21]  Jiayu Zhou,et al.  Transfer Learning in Deep Reinforcement Learning: A Survey , 2020, ArXiv.

[22]  Virginia R. de Sa,et al.  Deep Transfer Learning with Ridge Regression , 2020, ArXiv.

[23]  Ke Xu,et al.  BERT Loses Patience: Fast and Robust Inference with Early Exit , 2020, NeurIPS.

[24]  Zaid Alyafeai,et al.  A Survey on Transfer Learning in Natural Language Processing , 2020, ArXiv.

[25]  Yingyu Liang,et al.  Gradients as Features for Deep Representation Learning , 2020, ICLR.

[26]  Nadir Durrani,et al.  Analyzing Redundancy in Pretrained Transformer Models , 2020, EMNLP.

[27]  Julien Mairal,et al.  Selecting Relevant Features from a Multi-domain Representation for Few-Shot Classification , 2020, ECCV.

[28]  Aren Jansen,et al.  Towards Learning a Universal Non-Semantic Representation of Speech , 2020, INTERSPEECH.

[29]  Kate Saenko,et al.  A Broader Study of Cross-Domain Few-Shot Learning , 2019, ECCV.

[30]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[31]  James Zou,et al.  Concrete Autoencoders for Differentiable Feature Selection and Reconstruction , 2019, ArXiv.

[32]  Yonatan Belinkov,et al.  What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models , 2018, AAAI.

[33]  Rogério Schmidt Feris,et al.  SpotTune: Transfer Learning Through Adaptive Fine-Tuning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Tudor Dumitras,et al.  Shallow-Deep Networks: Understanding and Mitigating Network Overthinking , 2018, ICML.

[35]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Andreas Dengel,et al.  EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification , 2017, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[37]  Giorgos Borboudakis,et al.  Forward-Backward Selection with Early Dropping , 2017, J. Mach. Learn. Res..

[38]  Kibok Lee,et al.  A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks , 2018, NeurIPS.

[39]  Max Welling,et al.  Rotation Equivariant CNNs for Digital Pathology , 2018, MICCAI.

[40]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[41]  Yan Liu,et al.  Deep residual learning for image steganalysis , 2018, Multimedia Tools and Applications.

[42]  Tieniu Tan,et al.  Feature Selection Based on Structured Sparsity: A Comprehensive Study , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[43]  Andrea Vedaldi,et al.  Learning multiple visual domains with residual adapters , 2017, NIPS.

[44]  Mark Sandler,et al.  Transfer Learning for Music Classification and Regression Tasks , 2017, ISMIR.

[45]  Juhan Nam,et al.  Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging , 2017, IEEE Signal Processing Letters.

[46]  Xiaoqiang Lu,et al.  Remote Sensing Image Scene Classification: Benchmark and State of the Art , 2017, Proceedings of the IEEE.

[47]  Andrea Vedaldi,et al.  Universal representations: The missing link between faces, text, planktons, and cat breeds , 2017, ArXiv.

[48]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[51]  Kavita Bala,et al.  Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[54]  Iasonas Kokkinos,et al.  Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[56]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[57]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[59]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[60]  Lei Wang,et al.  Efficient Spectral Feature Selection with Minimum Redundancy , 2010, AAAI.

[61]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[62]  Francesca Odone,et al.  Feature selection for high-dimensional data , 2009, Comput. Manag. Sci..

[63]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[64]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[65]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[66]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[68]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[69]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[70]  Jürgen Schmidhuber,et al.  Shifting Inductive Bias with Success-Story Algorithm, Adaptive Levin Search, and Incremental Self-Improvement , 1997, Machine Learning.

[71]  Wei Tang,et al.  Ensembling neural networks: Many could be better than all , 2002, Artif. Intell..

[72]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[73]  Sebastian Thrun,et al.  Learning to Learn: Introduction and Overview , 1998, Learning to Learn.

[74]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[75]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[76]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[77]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[78]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[79]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .