DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be repurposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.

[1]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[2]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[3]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[4]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[5]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[6]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[7]  Long Zhu,et al.  Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing , 2006, NIPS.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  Trevor Darrell,et al.  Adapting Visual Category Models to New Domains , 2010, ECCV.

[10]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[11]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  John Langford,et al.  Multi-Label Prediction via Compressed Sensing , 2009, NIPS.

[13]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[14]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[15]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[16]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[17]  Yoshua Bengio,et al.  Unsupervised and Transfer Learning Challenge: a Deep Learning Approach , 2011, ICML Unsupervised and Transfer Learning.

[18]  Trevor Darrell,et al.  What you saw is not what you get: Domain adaptation using asymmetric kernel transforms , 2011, CVPR 2011.

[19]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[20]  Pramodita Sharma 2012 , 2013, Les 25 ans de l’OMC: Une rétrospective en photos.

[21]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Andrew W. Fitzgibbon,et al.  Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[23]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[24]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[25]  Alexander G. Hauptmann,et al.  LSCOM Lexicon Definitions and Annotations (Version 1.0) , 2006 .

[26]  Dieter Fox,et al.  Kernel Descriptors for Visual Recognition , 2010, NIPS.

[27]  Sumit Chopra,et al.  DLID: Deep Learning for Domain Adaptation by Interpolating between Domains , 2013 .

[28]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[29]  Forrest N. Iandola,et al.  Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction , 2013, 2013 IEEE International Conference on Computer Vision.

[30]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Yuan Shi,et al.  Geodesic flow kernel for unsupervised domain adaptation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[34]  Wen Gao,et al.  Group-sensitive multiple kernel learning for object categorization , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[36]  Sanja Fidler,et al.  Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Peter N. Belhumeur,et al.  POOF: Part-Based One-vs.-One Features for Fine-Grained Categorization, Face Verification, and Attribute Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Trevor Darrell,et al.  Efficient Learning of Domain-invariant Image Representations , 2013, ICLR.

[39]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[40]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[41]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[42]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[43]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[44]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[45]  Trevor Darrell,et al.  Transfer learning for image classification with sparse prototype representations , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Subhransu Maji,et al.  Describing people: A poselet-based approach to attribute classification , 2011, 2011 International Conference on Computer Vision.

[47]  Deva Ramanan,et al.  Histograms of Sparse Codes for Object Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.