Learned features versus engineered features for multimedia indexing

In this paper, we compare “traditional” engineered (hand-crafted) features (or descriptors) and learned features for content-based indexing of image or video documents. Learned (or semantic) features are obtained by training classifiers on a source collection containing samples annotated with concepts. These classifiers are applied to the samples of a destination collection and the classification scores for each sample are gathered into a vector that becomes a feature for it. These feature vectors are then used for training another classifier for the destination concepts on the destination collection. If the classifiers used on the source collection are Deep Convolutional Neural Networks (DCNNs), it is possible to use as a new feature vector also the intermediate values corresponding to the output of all the hidden layers. We made an extensive comparison of the performance of such features with traditional engineered ones as well as with combinations of them. The comparison was made in the context of the TRECVid semantic indexing task. Our results confirm those obtained for still images: features learned from other training data generally outperform engineered features for concept recognition. Additionally, we found that directly training KNN and SVM classifiers using these features performs significantly better than partially retraining the DCNN for adapting it to the new data. We also found that, even though the learned features performed better that the engineered ones, fusing both of them performs even better, indicating that engineered features are still useful, at least in the considered case. Finally, the combination of DCNN features with KNN and SVM classifiers was applied to the VOC 2012 object classification task where it currently obtains the best performance with a MAP of 85.4 %.

[1]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[4]  John R. Smith,et al.  Multimedia semantic indexing using model vectors , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[5]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[6]  David Picard,et al.  Efficient image signatures and similarities using tensor products of local descriptors , 2013, Comput. Vis. Image Underst..

[7]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[8]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Patrick Lambert,et al.  Retina enhanced SIFT descriptors for video indexing , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[10]  Georges Quénot,et al.  A factorized model for multiple SVM and multi-label classification for large scale multimedia indexing , 2015, 2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI).

[11]  Matthieu Cord,et al.  Combining visual dictionary, kernel-based similarity and learning strategy for image category retrieval , 2008, Comput. Vis. Image Underst..

[12]  Alice Caplier,et al.  Using Human Visual System modeling for bio-inspired low level image processing , 2010, Comput. Vis. Image Underst..

[13]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[14]  Hervé Le Borgne,et al.  Locality-constrained and spatially regularized coding for scene categorization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[16]  Marcel Worring,et al.  On the surplus value of semantic video analysis beyond the key frame , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[17]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[18]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[19]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[20]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[21]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[22]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[24]  GeversTheo,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010 .

[25]  Georges Quénot,et al.  Re-ranking by local re-scoring for video indexing and retrieval , 2011, CIKM '11.

[26]  Georges Quénot,et al.  Extended conceptual feedback for semantic multimedia indexing , 2014, Multimedia Tools and Applications.

[27]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[28]  Stéphane Ayache,et al.  Image and Video Indexing Using Networks of Operators , 2007, EURASIP J. Image Video Process..

[29]  Georges Quénot,et al.  Evaluations of multi-learner approaches for concept indexing in video documents , 2010, RIAO.

[30]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[32]  Charles-Edmond Bichot,et al.  Color orthogonal local binary patterns combination for image region description ( Technical Report ) , 2011 .

[33]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Georges Quénot,et al.  Descriptor optimization for multimedia indexing and retrieval , 2013, Multimedia Tools and Applications.

[35]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[36]  Stéphane Ayache,et al.  Video Corpus Annotation Using Active Learning , 2008, ECIR.

[37]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[38]  Georges Quénot,et al.  Hierarchical Late Fusion for Concept Detection in Videos , 2012, ECCV Workshops.

[39]  Emmanuel Dellandréa,et al.  IRIM at TRECVID 2015: Semantic Indexing , 2015, TRECVID.

[40]  Georges Quénot,et al.  LIG at TRECVid 2015: Semantic Indexing , 2015, TRECVID.

[41]  Patrick Lambert,et al.  Retina enhanced bag of words descriptors for video classification , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[42]  Frédéric Jurie,et al.  Improving Image Classification Using Semantic Attributes , 2012, International Journal of Computer Vision.

[43]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.