Local Features and a Two-Layer Stacking Architecture for Semantic Concept Detection in Video

In this paper, we deal with the problem of extending and using different local descriptors, as well as exploiting concept correlations, toward improved video semantic concept detection. We examine how the state-of-the-art binary local descriptors can facilitate concept detection, we propose color extensions of them inspired by previously proposed color extensions of scale invariant feature transform, and we show that the latter color extension paradigm is generally applicable to both binary and nonbinary local descriptors. In order to use them in conjunction with a state-of-the-art feature encoding, we compact the above color extensions using PCA and we compare two alternatives for doing this. Concerning the learning stage of concept detection, we perform a comparative study and propose an improved way of employing stacked models, which capture concept correlations, using multilabel classification algorithms in the last layer of the stack. We examine and compare the effectiveness of the above algorithms in both semantic video indexing within a large video collection and in the somewhat different problem of individual video annotation with semantic concepts, on the extensive video data set of the 2013 TRECVID Semantic Indexing Task. Several conclusions are drawn from these experiments on how to improve the video semantic concept detection.

[1]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[3]  Tat-Seng Chua,et al.  Automatic image annotation via local multi-label classification , 2008, CIVR '08.

[4]  Patrick Lambert,et al.  Retina enhanced SURF descriptors for spatio-temporal concept detection , 2012, Multimedia Tools and Applications.

[5]  John R. Smith,et al.  Multimedia semantic indexing using model vectors , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[6]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[8]  G. Qiu Indexing chromatic and achromatic patterns for content-based colour image retrieval , 2002, Pattern Recognit..

[9]  Chong-Wah Ngo,et al.  Semantic Indexing and Multimedia Event Detection: ECNU at TRECVID 2012 , 2012, TRECVID.

[10]  Arnold W. M. Smeulders,et al.  Color Invariant SURF in Discriminative Object Tracking , 2010, ECCV Workshops.

[11]  Georges Quénot,et al.  Re-ranking by local re-scoring for video indexing and retrieval , 2011, CIKM '11.

[12]  Yiannis Kompatsiaris,et al.  On the Use of Visual Soft Semantics for Video Temporal Decomposition to Scenes , 2010, 2010 IEEE Fourth International Conference on Semantic Computing.

[13]  Georges Quénot,et al.  Conceptual feedback for semantic multimedia indexing , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[14]  Abbas Z. Kouzani,et al.  Empirical Study of Multi-label Classification Methods for Image Annotation and Retrieval , 2010, 2010 International Conference on Digital Image Computing: Techniques and Applications.

[15]  João Ascenso,et al.  Evaluation of low-complexity visual feature detectors and descriptors , 2013, 2013 18th International Conference on Digital Signal Processing (DSP).

[16]  Yiannis Kompatsiaris,et al.  A Joint Content-Event Model for Event-Centric Multimedia Indexing , 2010, 2010 IEEE Fourth International Conference on Semantic Computing.

[17]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[18]  Grigorios Tsoumakas,et al.  MULAN: A Java Library for Multi-Label Learning , 2011, J. Mach. Learn. Res..

[19]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[20]  Peng Fan,et al.  Color-SURF: A surf descriptor with local kernel color histograms , 2009, 2009 IEEE International Conference on Network Infrastructure and Digital Content.

[21]  Yiannis Kompatsiaris,et al.  ITI-CERTH participation to TRECVID 2015 , 2015, TRECVID.

[22]  Yiannis Kompatsiaris,et al.  A Comparative Study on the Use of Multi-label Classification Techniques for Concept-Based Video Indexing and Annotation , 2014, MMM.

[23]  Eyke Hüllermeier,et al.  Multilabel classification via calibrated label ranking , 2008, Machine Learning.

[24]  Yueming Lu,et al.  C-SURF: Colored Speeded Up Robust Features , 2012, ISCTCS.

[25]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[26]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[27]  Marek R. Ogiela,et al.  Multimedia tools and applications , 2005, Multimedia Tools and Applications.

[28]  Patrick Lambert,et al.  Retina enhanced bag of words descriptors for video classification , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[29]  Tao Mei,et al.  Correlative multi-label video annotation , 2007, ACM Multimedia.

[30]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[31]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[32]  Mei-Ling Shyu,et al.  Florida International University and University of Miami TRECVID 2011 , 2011, TRECVID.

[33]  Yiannis Kompatsiaris,et al.  Video Tomographs and a Base Detector Selection Strategy for Improving Large-Scale Video Concept Detection , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[34]  Bernd Girod,et al.  Interframe Coding of Global Image Signatures for Mobile Augmented Reality , 2014, 2014 Data Compression Conference.

[35]  Pierre Vandergheynst,et al.  FREAK: Fast Retina Keypoint , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Roland Siegwart,et al.  BRISK: Binary Robust invariant scalable keypoints , 2011, 2011 International Conference on Computer Vision.

[37]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Jesse Read,et al.  A Pruned Problem Transformation Method for Multi-label Classification , 2008 .

[39]  Grigorios Tsoumakas,et al.  Correlation-Based Pruning of Stacked Binary Relevance Models for Multi-Label Learning , 2009 .

[40]  Rita Cucchiara,et al.  A fast approach for integrating ORB descriptors in the bag of words model , 2013, Electronic Imaging.

[41]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[42]  Yung-Yu Chuang,et al.  Cross-Domain Multicue Fusion for Concept-Based Video Indexing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Yiannis Kompatsiaris,et al.  Improving Interactive Video Retrieval by Exploiting Automatically-Extracted Video Structural Semantics , 2011, 2011 IEEE Fifth International Conference on Semantic Computing.

[44]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[45]  Koen E. A. van de Sande,et al.  Fisher and VLAD with FLAIR , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Shih-Fu Chang,et al.  Active Context-Based Concept Fusionwith Partial User Labels , 2006, 2006 International Conference on Image Processing.

[47]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[48]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[49]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[50]  Ioannis Patras,et al.  A Study on the Use of a Binary Local Descriptor and Color Extensions of Local Descriptors for Video Concept Detection , 2015, MMM.

[51]  Gabriela Csurka,et al.  Fisher Vectors: Beyond Bag-of-Visual-Words Image Representations , 2010, VISIGRAPP.