IRIM at TRECVID 2014: Semantic Indexing and Instance Search

The IRIM group is a consortium of French teams supported by the GDR ISIS and working on Multimedia Indexing and Retrieval. This paper describes its participation to the TRECVID 2014 semantic indexing (SIN) and instance search (INS) tasks. For the semantic indexing task, our approach uses a six-stages processing pipelines for computing scores for the likelihood of a video shot to contain a target concept. These scores are then used for producing a ranked list of images or shots that are the most likely to contain the target concept. The pipeline is composed of the following steps: descriptor extraction, descriptor optimization, classification, fusion of descriptor variants, higher-level fusion, and re-ranking. We evaluated a number of different descriptors and tried different fusion strategies. The best IRIM run has a Mean Inferred Average Precision of 0.2796, which ranked us 5th out of 15 participants.For INS 2014 task IRIM participation, the classical BoW approach was followed, trained only with east-enders dataset. Shot signatures were computed on one key frame, or several key frames (at 1fps) and average pooling. A dissimilarity, computing a distance only for words present in query, was tested. A saliency map, build from object ROI to incorporate background context, was tried. Late fusion of two individual BoWresults, with different detectors/descriptors (Hessian-Affine/SIFT and Harris-Laplace/Opponent SIFT), was used. The four submitted runs were the following:- Run F_D_IRIM_1 was the late fusion of BOW with SIFT, dissimilarity L2p, on several key frames per shot, with context for queries, and BOW with Opponent SIFT, dissimilarity L1p, on one key frame per shot.- Run F_D_IRIM_2 was similar to F_D_IRIM_1 but context for queries used also for second BoW.- Run F_D_IRIM_3 was similar to F_D_IRIM_1 but no context for queries used.- Run F_D_IRIM_4 was similar to F_D_IRIM_2 but using delta1 dissimilarity [46] (from INS 2013 best run).We found that extracting several key frames per shot coupled with average pooling improved results. We confirmed than including context in queries was also beneficial. Surprisingly, our dissimilarity performed better than delta1.

[1]  Patrick Lambert,et al.  Retina enhanced bag of words descriptors for video classification , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[2]  Georges Quénot,et al.  LIG at TRECVid 2014: Semantic Indexing , 2014, TRECVID.

[3]  Andrej Mikulík,et al.  Large-Scale Content-Based Sub-Image Search , 2014 .

[4]  Patrick Lambert,et al.  Bags of Trajectory Words for video indexing , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).

[5]  Shin'ichi Satoh,et al.  Multi-image aggregation for better visual object retrieval , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Sabin Tiberius Strat,et al.  Retina enhanced SURF descriptors for spatio-temporal concept detection , 2014, Multimedia Tools and Applications.

[7]  Shin'ichi Satoh,et al.  Query-Adaptive Asymmetrical Dissimilarities for Visual Object Retrieval , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Georges Quénot,et al.  Conceptual feedback for semantic multimedia indexing , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[9]  Patrick Lambert,et al.  Retina enhanced SIFT descriptors for video indexing , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[10]  Georges Quénot,et al.  Descriptor optimization for multimedia indexing and retrieval , 2013, Multimedia Tools and Applications.

[11]  David Picard,et al.  Efficient image signatures and similarities using tensor products of local descriptors , 2013, Comput. Vis. Image Underst..

[12]  Jiri Matas,et al.  Learning Vocabularies over a Fine Quantization , 2013, International Journal of Computer Vision.

[13]  Chong-Wah Ngo,et al.  Searching visual instances with topology checking and context modeling , 2013, ICMR.

[14]  Georges Quénot,et al.  Hierarchical Late Fusion for Concept Detection in Videos , 2012, ECCV Workshops.

[15]  Gabriela Csurka,et al.  An empirical study of fusion operators for multimodal image retrieval , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[16]  Hervé Le Borgne,et al.  Locality-constrained and spatially regularized coding for scene categorization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[19]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[20]  Georges Quénot,et al.  Re-ranking by local re-scoring for video indexing and retrieval , 2011, CIKM '11.

[21]  Jiri Matas,et al.  Total recall II: Query expansion revisited , 2011, CVPR 2011.

[22]  Bernard Mérialdo,et al.  Saliency moments for image categorization , 2011, ICMR.

[23]  Koen E. A. van de Sande,et al.  Empowering Visual Categorization With the GPU , 2011, IEEE Transactions on Multimedia.

[24]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Hervé Glotin,et al.  Pyramidal Multi-level Features for the Robot Vision@ICPR 2010 Challenge , 2010, 2010 20th International Conference on Pattern Recognition.

[26]  Alice Caplier,et al.  Using Human Visual System modeling for bio-inspired low level image processing , 2010, Comput. Vis. Image Underst..

[27]  Georges Quénot,et al.  Evaluations of multi-learner approaches for concept indexing in video documents , 2010, RIAO.

[28]  C. Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[29]  Koen E. A. van de Sande,et al.  Evaluation of color descriptors for object and scene recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Matthieu Cord,et al.  Combining visual dictionary, kernel-based similarity and learning strategy for image category retrieval , 2008, Comput. Vis. Image Underst..

[31]  Stéphane Ayache,et al.  Video Corpus Annotation Using Active Learning , 2008, ECIR.

[32]  Jean-Loup Guillaume,et al.  Fast unfolding of community hierarchies in large networks , 2008, ArXiv.

[33]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[35]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[36]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[37]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[38]  K. Mikolajczyk,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[39]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[40]  Shu-Yuan Chen,et al.  Image classification using color, texture and regions , 2003, Image Vis. Comput..

[41]  Mario A. Nascimento,et al.  A compact and efficient image retrieval approach based on border/interior pixel classification , 2002, CIKM '02.

[42]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[43]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[44]  Nicolas Ballas,et al.  IRIM at TRECVID 2013: Semantic indexing and multimedia instance search , 2013 .

[45]  Georges Quénot,et al.  Quaero at TRECVID 2013: Semantic Indexing , 2013, TRECVID.

[46]  Chong-Wah Ngo,et al.  VIREO/ECNU @ TRECVID 2013: A Video Dance of Detection, Recounting and Search with Motion Relativity and Concept Learning from Wild , 2013, TRECVID.

[47]  Andrew Zisserman,et al.  Multiple queries for large scale specific object retrieval , 2012, BMVC.

[48]  Charles-Edmond Bichot,et al.  Color orthogonal local binary patterns combination for image region description ( Technical Report ) , 2011 .

[49]  Miriam Redi,et al.  EURECOM at TrecVid 2011: The Light Semantic Indexing Task , 2011, TRECVID.

[50]  Stéphane Ayache,et al.  IRIM at TRECVID 2010: High Level Feature Extraction and Instance Search , 2010 .

[51]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[52]  Christopher Hunt SURF: Speeded-Up Robust Features , 2009 .

[53]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[54]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[55]  Jorge Sánchez,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.