Learned features versus engineered features for semantic video indexing

In this paper, we compare “traditional” engineered (hand-crafted) features (or descriptors) and learned features for content-based semantic indexing of video documents. Learned (or semantic) features are obtained by training classifiers for other target concepts on other data. These classifiers are then applied to the current collection. The vector of classification scores is the new feature used for training a classifier for the current target concepts on the current collection. If the classifiers used on the other collection are of the Deep Convolutional Neural Network (DCNN) type, it is possible to use as a new feature not only the score values provided by the last layer but also the intermediate values corresponding to the output of all the hidden layers. We made an extensive comparison of the performance of such features with traditional engineered ones as well as with combinations of them. The comparison was made in the context of the TRECVid semantic indexing task. Our results confirm those obtained for still images: features learned from other training data generally outperform engineered features for concept recognition. Additionally, we found that directly training SVM classifiers using these features does significantly better than partially retraining the DCNN for adapting it to the new data. We also found that, even though the learned features performed better that the engineered ones, the fusion of both of them perform significantly better, indicating that engineered features are still useful, at least in this case.

[1]  Hervé Le Borgne,et al.  Locality-constrained and spatially regularized coding for scene categorization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[4]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[6]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Georges Quénot,et al.  Descriptor optimization for multimedia indexing and retrieval , 2013, Multimedia Tools and Applications.

[8]  Georges Quénot,et al.  Re-ranking by local re-scoring for video indexing and retrieval , 2011, CIKM '11.

[9]  Georges Quénot,et al.  Hierarchical Late Fusion for Concept Detection in Videos , 2014, Fusion in Computer Vision.

[10]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[11]  Georges Quénot,et al.  Extended conceptual feedback for semantic multimedia indexing , 2014, Multimedia Tools and Applications.

[12]  Stéphane Ayache,et al.  Image and Video Indexing Using Networks of Operators , 2007, EURASIP J. Image Video Process..

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[15]  Hervé Glotin,et al.  IRIM at TRECVID 2014: Semantic Indexing and Instance Search , 2014, TRECVID.

[16]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[17]  Bernard Mérialdo,et al.  Improving video concept detection using uploader model , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[18]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[19]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[20]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Georges Quénot,et al.  LIG at TRECVid 2014: Semantic Indexing , 2014, TRECVID.

[22]  John R. Smith,et al.  Multimedia semantic indexing using model vectors , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[23]  Matthieu Cord,et al.  Combining visual dictionary, kernel-based similarity and learning strategy for image category retrieval , 2008, Comput. Vis. Image Underst..

[24]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[25]  Charles-Edmond Bichot,et al.  Color orthogonal local binary patterns combination for image region description ( Technical Report ) , 2011 .

[26]  Stéphane Ayache,et al.  Video Corpus Annotation Using Active Learning , 2008, ECIR.

[27]  Frédéric Jurie,et al.  Improving Image Classification Using Semantic Attributes , 2012, International Journal of Computer Vision.

[28]  Patrick Lambert,et al.  Retina enhanced SIFT descriptors for video indexing , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[29]  Alice Caplier,et al.  Using Human Visual System modeling for bio-inspired low level image processing , 2010, Comput. Vis. Image Underst..

[30]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[31]  Patrick Lambert,et al.  Retina enhanced bag of words descriptors for video classification , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[32]  David Picard,et al.  Efficient image signatures and similarities using tensor products of local descriptors , 2013, Comput. Vis. Image Underst..

[33]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.