What makes ImageNet good for transfer learning?

The tremendous success of ImageNet-trained deep features on a wide range of transfer tasks begs the question: what are the properties of the ImageNet dataset that are critical for learning good, general-purpose features? This work provides an empirical investigation of various facets of this question: Is more pre-training data always better? How does feature quality depend on the number of training examples per class? Does adding more object classes improve performance? For the same data budget, how should the data be split into classes? Is fine-grained recognition necessary for learning good features? Given the same number of training classes, is it better to have coarse classes or fine-grained classes? Which is better: more classes or more examples per class? To answer these and related questions, we pre-trained CNN features on various subsets of the ImageNet dataset and evaluated transfer performance on PASCAL detection, PASCAL action classification, and SUN scene classification tasks. Our overall findings suggest that most changes in the choice of pre-training data long thought to be critical do not significantly affect transfer performance.? Given the same number of training classes, is it better to have coarse classes or fine-grained classes? Which is better: more classes or more examples per class?

[1]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[2]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[3]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[4]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[5]  Michael Jünger,et al.  Improving Walker's Algorithm to Run in Linear Time , 2002, GD.

[6]  H. Bourlard,et al.  Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.

[7]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[9]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Pascal Vincent,et al.  Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[12]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[14]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[15]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  R. Fergus,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[17]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[18]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[19]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[20]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[21]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[22]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[23]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[24]  Jitendra Malik,et al.  Analyzing the Performance of Multilayer Neural Networks for Object Recognition , 2014, ECCV.

[25]  Jitendra Malik,et al.  Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Alan L. Yuille,et al.  Discovering Internal Representations from Object-CNNs Using Population Encoding , 2015, ArXiv.

[27]  Jonathan Tompson,et al.  Unsupervised Feature Learning from Temporal Data , 2015, ICLR.

[28]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Alan Yuille,et al.  Unsupervised learning of object semantic parts from internal states of CNNs by population encoding , 2015, 1511.06855.

[30]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[31]  Thomas Brox,et al.  Inverting Convolutional Networks with Convolutional Networks , 2015, ArXiv.

[32]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[33]  Mathieu Aubry,et al.  Understanding Deep Features with Computer-Generated Imagery , 2015, ICCV.

[34]  Kristen Grauman,et al.  Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Atsuto Maki,et al.  From generic to specific deep representations for visual recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[39]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[40]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Bolei Zhou,et al.  Understanding Intra-Class Knowledge Inside CNN , 2015, ArXiv.

[43]  Thomas Brox,et al.  Inverting Visual Representations with Convolutional Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Stéphane Mallat,et al.  Understanding deep convolutional networks , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[45]  Jian Sun,et al.  Instance-Aware Semantic Segmentation via Multi-task Network Cascades , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Ahmed M. Elgammal,et al.  Digging Deep into the Layers of CNNs: In Search of How CNNs Achieve View Invariance , 2015, ICLR.

[47]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[49]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[50]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[53]  Allan Jabri,et al.  Learning Visual Features from Large Weakly Supervised Data , 2015, ECCV.

[54]  Gregory Shakhnarovich,et al.  Learning Representations for Automatic Colorization , 2016, ECCV.

[55]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Trevor Darrell,et al.  Data-dependent Initializations of Convolutional Neural Networks , 2015, ICLR.

[57]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.