A comparative study for multiple visual concepts detection in images and videos

Automatic indexing of images and videos is a highly relevant and important research area in multimedia information retrieval. The difficulty of this task is no longer something to prove. Most efforts of the research community have been focusing, in the past, on the detection of single concepts in images/videos, which is already a hard task. With the evolution of information retrieval systems, users’ needs become more abstract, and lead to a larger number of words composing the queries. It is important to think about indexing multimedia documents with more than just individual concepts, to help retrieval systems to answer such complex queries. Few studies addressed specifically the problem of detecting multiple concepts (multi-concept) in images and videos. Most of them concern the detection of concept pairs. These studies showed that such challenge is even greater than the one of single concept detection. In this work, we address the problem of multi-concept detection in images/videos by making a comparative and detailed study. Three types of approaches are considered: 1) building detectors for multi-concept, 2) fusing single concepts detectors and 3) exploiting detectors of a set of single concepts in a stacking scheme. We conducted our evaluations on PASCAL VOC’12 collection regarding the detection of pairs and triplets of concepts. We extended the evaluation process on TRECVid 2013 dataset for infrequent concept pairs’ detection. Our results show that the three types of approaches give globally comparable results for images, but they differ for specific kinds of pairs/triplets. In the case of videos, late fusion of detectors seems to be more effective and efficient when single concept detectors have good performances. Otherwise, directly building bi-concept detectors remains the best alternative, especially if a well-annotated dataset is available. The third approach did not bring additional gain or efficiency.

[1]  Stéphane Ayache,et al.  Video Corpus Annotation Using Active Learning , 2008, ECIR.

[2]  Marcel Worring,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Harvesting Social Images for Bi-Concept Search , 2022 .

[3]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[4]  Edward A. Fox,et al.  Research Contributions , 2014 .

[5]  Georges Quénot,et al.  Evaluations of multi-learner approaches for concept indexing in video documents , 2010, RIAO.

[6]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[7]  Robert LIN,et al.  NOTE ON FUZZY SETS , 2014 .

[8]  Djoerd Hiemstra,et al.  A probabilistic ranking framework using unobservable binary events for video search , 2008, CIVR '08.

[9]  Sharath Pankanti,et al.  IBM Research and Columbia University TRECVID-2013 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), Surveillance Event Detection (SED), and Semantic Indexing (SIN) Systems , 2013, TRECVID.

[10]  Georges Quénot,et al.  Descriptor optimization for multimedia indexing and retrieval , 2013, Multimedia Tools and Applications.

[11]  Dong Wang,et al.  Video search in concept subspace: a text-like paradigm , 2007, CIVR '07.

[12]  Tao Mei,et al.  Correlative multi-label video annotation , 2007, ACM Multimedia.

[13]  Georges Quénot,et al.  Conceptual feedback for semantic multimedia indexing , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[14]  Georges Quénot,et al.  Extended conceptual feedback for semantic multimedia indexing , 2014, Multimedia Tools and Applications.

[15]  Marcel Worring,et al.  Adding Semantics to Detectors for Video Retrieval , 2007, IEEE Transactions on Multimedia.

[16]  John R. Smith,et al.  Multimedia semantic indexing using model vectors , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[17]  Dong Xu,et al.  Columbia University TRECVID-2006 Video Search and High-Level Feature Extraction , 2006, TRECVID.

[18]  Georges Quénot,et al.  Quaero at TRECVID 2013: Semantic Indexing and Instance Search , 2013 .

[19]  Yung-Yu Chuang,et al.  Multi-cue fusion for semantic video indexing , 2008, ACM Multimedia.

[20]  Rong Yan,et al.  The combination limit in multimedia retrieval , 2003, MULTIMEDIA '03.

[21]  Min Chen,et al.  An Effective Multi-concept Classifier for Video Streams , 2008, 2008 IEEE International Conference on Semantic Computing.

[22]  Jorma Laaksonen,et al.  PicSOM Experiments in TRECVID 2018 , 2015, TRECVID.

[23]  Chong-Wah Ngo,et al.  Concept-Driven Multi-Modality Fusion for Video Search , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Hervé Glotin,et al.  IRIM at TRECVID 2014: Semantic Indexing and Instance Search , 2014, TRECVID.

[25]  Gang Wang,et al.  Joint learning of visual attributes, object classes and visual saliency , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[26]  Rong Yan,et al.  Multi-concept learning with large-scale multimedia lexicons , 2008, 2008 15th IEEE International Conference on Image Processing.

[27]  Shih-Fu Chang,et al.  Advanced techniques for semantic concept detection in general videos , 2010 .