Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning

Transfer-learning methods aim to improve performance in a data-scarce target domain using a model pretrained on a data-rich source domain. A cost-efficient strategy, linear probing, involves freezing the source model and training a new classification head for the target domain. This strategy is outperformed by a more costly but state-of-the-art method—fine-tuning all parameters of the source model to the target domain—possibly because fine-tuning allows the model to leverage useful information from intermediate layers which is otherwise discarded by the later pretrained layers. We explore the hypothesis that these intermediate layers might be directly exploited. We propose a method, Head-to-Toe probing (HEAD2TOE), that selects features from all layers of the source model to train a classification head for the target-domain. In evaluations on the Visual Task Adaptation Benchmark (VTAB), Head2Toe matches performance obtained with fine-tuning on average while reducing training and storage cost hundred folds or more, but critically, for out-of-distribution transfer, Head2Toe outperforms fine-tuning.1

[1]  Iasonas Kokkinos,et al.  Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[3]  Nadir Durrani,et al.  Analyzing Redundancy in Pretrained Transformer Models , 2020, EMNLP.

[4]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[5]  Jiayu Zhou,et al.  Transfer Learning in Deep Reinforcement Learning: A Survey , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Tudor Dumitras,et al.  Shallow-Deep Networks: Understanding and Mitigating Network Overthinking , 2018, ICML.

[7]  Irfan Ahmad,et al.  A Survey on Transfer Learning in Natural Language Processing , 2020, ArXiv.

[8]  Rogério Schmidt Feris,et al.  SpotTune: Transfer Learning Through Adaptive Fine-Tuning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Wei Tang,et al.  Ensembling neural networks: Many could be better than all , 2002, Artif. Intell..

[10]  Xiaohua Zhai,et al.  Comparing Transfer and Meta Learning Approaches on a Unified Few-Shot Classification Benchmark , 2021, ArXiv.

[11]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Andreas Geiger,et al.  Projected GANs Converge Faster , 2021, NeurIPS.

[13]  Lei Wang,et al.  Efficient Spectral Feature Selection with Minimum Redundancy , 2010, AAAI.

[14]  Behnam Neyshabur,et al.  Deep Learning Through the Lens of Example Difficulty , 2021, NeurIPS.

[15]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[16]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[17]  Tieniu Tan,et al.  Feature Selection Based on Structured Sparsity: A Comprehensive Study , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[19]  Max Welling,et al.  Rotation Equivariant CNNs for Digital Pathology , 2018, MICCAI.

[20]  Joan Puigcerver,et al.  Scalable Transfer Learning with Expert Models , 2020, ICLR.

[21]  Hui Xiong,et al.  A Comprehensive Survey on Transfer Learning , 2019, Proceedings of the IEEE.

[22]  Yingyu Liang,et al.  Gradients as Features for Deep Representation Learning , 2020, ICLR.

[23]  Amos Storkey,et al.  Meta-Learning in Neural Networks: A Survey , 2020, IEEE transactions on pattern analysis and machine intelligence.

[24]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[26]  Andreas Dengel,et al.  EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification , 2017, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[27]  Julien Mairal,et al.  Selecting Relevant Features from a Multi-domain Representation for Few-Shot Classification , 2020, ECCV.

[28]  Xiaoqiang Lu,et al.  Remote Sensing Image Scene Classification: Benchmark and State of the Art , 2017, Proceedings of the IEEE.

[29]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[30]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[31]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[32]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Hugo Larochelle,et al.  Learning a Universal Template for Few-shot Dataset Generalization , 2021, ICML.

[34]  Jürgen Schmidhuber,et al.  Shifting Inductive Bias with Success-Story Algorithm, Adaptive Levin Search, and Incremental Self-Improvement , 1997, Machine Learning.

[35]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[36]  Kavita Bala,et al.  Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Hakan Bilen,et al.  Improving Task Adaptation for Cross-domain Few-shot Learning , 2021, ArXiv.

[39]  Andrew Gordon Wilson,et al.  Fast Adaptation with Linearized Neural Networks , 2021, AISTATS.

[40]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[41]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[42]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[43]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[44]  Lu Liu,et al.  A Universal Representation Transformer Layer for Few-Shot Image Classification , 2020, ICLR.

[45]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[46]  Francesca Odone,et al.  Feature selection for high-dimensional data , 2009, Comput. Manag. Sci..

[47]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[48]  Yan Liu,et al.  Deep residual learning for image steganalysis , 2018, Multimedia Tools and Applications.

[49]  Alexander M. Rush,et al.  Parameter-Efficient Transfer Learning with Diff Pruning , 2021, ACL/IJCNLP.

[50]  Mykola Pechenizkiy,et al.  Quick and robust feature selection: the strength of energy-efficient sparse training for autoencoders , 2020, Machine Learning.

[51]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Sebastian Thrun,et al.  Learning to Learn: Introduction and Overview , 1998, Learning to Learn.

[53]  Joan Puigcerver,et al.  Deep Ensembles for Low-Data Transfer Learning , 2020, ArXiv.

[54]  Giorgos Borboudakis,et al.  Forward-Backward Selection with Early Dropping , 2017, J. Mach. Learn. Res..

[55]  Hakan Bilen,et al.  Universal Representation Learning from Multiple Domains for Few-shot Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Furu Wei,et al.  BERT Loses Patience: Fast and Robust Inference with Early Exit , 2020, NeurIPS.

[57]  Andrea Vedaldi,et al.  Learning multiple visual domains with residual adapters , 2017, NIPS.

[58]  James Zou,et al.  Concrete Autoencoders for Differentiable Feature Selection and Reconstruction , 2019, ArXiv.

[59]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[60]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[61]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[63]  Aaron C. Courville,et al.  Can Subnetwork Structure be the Key to Out-of-Distribution Generalization? , 2021, ICML.

[64]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[65]  Andrea Vedaldi,et al.  Universal representations: The missing link between faces, text, planktons, and cat breeds , 2017, ArXiv.

[66]  Xinyun Chen,et al.  Learn-to-Share: A Hardware-friendly Transfer Learning Framework Exploiting Computation and Parameter Sharing , 2021, ICML.

[67]  Yonatan Belinkov,et al.  What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models , 2018, AAAI.

[68]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[69]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.