Using semantic context for multiple concepts detection in still images

Multimedia documents indexing systems performances have been improved significantly in recent years, especially after the involvement of deep learning approaches. However, this progress remains insufficient with the evolution of users' needs that become complex in terms of semantics and the number of words that compose their queries. So, it is important to think about indexing images by a group of concepts simultaneously (multi-concepts) and not just single ones. This would allow systems to better respond to queries composed of several terms. This task is much more difficult than indexing images by single concepts. Multi-concepts detection in images has been little dealt in the state of the art compared to the detection of visual single concepts. On the other hand, the use of context has proved its effectiveness in the field of multimedia semantic indexing. In this work, we propose two approaches that consider the semantic context for multi-concepts detection in still images. We tested and evaluated our proposal on the international standard corpus Pascal VOC for the detection of concepts pairs and triplets of concepts. Our contributions have shown that context is useful and improves multi-concepts detection in images. The combination of the use of semantic context and deep learning-based features yielded much better results than those of the state of the art. This difference in performance is estimated by a relative gain on mean average precision reaching + 70% for concepts pairs and + 34% for the case of triplets of concepts.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Guojun Lu,et al.  Evaluation of MPEG-7 shape descriptors against other shape descriptors , 2003, Multimedia Systems.

[3]  Patrick Brézillon,et al.  Context in problem solving: a survey , 1999, The Knowledge Engineering Review.

[4]  Marcel Worring,et al.  Adding Semantics to Detectors for Video Retrieval , 2007, IEEE Transactions on Multimedia.

[5]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[7]  Hervé Glotin,et al.  IRIM at TRECVID2009: High Level Feature Extraction , 2009 .

[8]  Jing Huang,et al.  Image indexing using color correlograms , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  John R. Smith,et al.  Multimedia semantic indexing using model vectors , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[10]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Xian-Sheng Hua,et al.  Two-Dimensional Active Learning for image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Thomas S. Huang,et al.  Factor graph framework for semantic video indexing , 2002, IEEE Trans. Circuits Syst. Video Technol..

[13]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Georges Quénot,et al.  Extended conceptual feedback for semantic multimedia indexing , 2014, Multimedia Tools and Applications.

[15]  Heng-Da Cheng,et al.  Effective image retrieval using dominant color descriptor and fuzzy support vector machine , 2009, Pattern Recognit..

[16]  Gang Wang,et al.  Joint learning of visual attributes, object classes and visual saliency , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  B. S. Manjunath,et al.  Color and texture descriptors , 2001, IEEE Trans. Circuits Syst. Video Technol..

[18]  Dong Wang,et al.  Video search in concept subspace: a text-like paradigm , 2007, CIVR '07.

[19]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[20]  David Dagan Feng,et al.  Improving News Video Annotation with Semantic Context , 2010, 2010 International Conference on Digital Image Computing: Techniques and Applications.

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[23]  Georges Quénot,et al.  Quaero at TRECVID 2013: Semantic Indexing and Instance Search , 2013 .

[24]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[25]  Bill N. Schilit,et al.  Context-aware computing applications , 1994, Workshop on Mobile Computing Systems and Applications.

[26]  Thomas M. Strat,et al.  Employing Contextual Information in Computer Vision , 1993 .

[27]  Georges Quénot,et al.  Re-ranking for Multimedia Indexing and Retrieval , 2011, ECIR.

[28]  Bernard. Merialdo,et al.  Eurecom at TRECVID 2009 High-Level Feature Extraction , 2009, TRECVID.

[29]  Georges Quénot,et al.  Descriptor optimization for multimedia indexing and retrieval , 2013, Multimedia Tools and Applications.

[30]  Xian-Sheng Hua,et al.  Image Classification With Kernelized Spatial-Context , 2010, IEEE Transactions on Multimedia.

[31]  B. S. Manjunath,et al.  Texture features and learning similarity , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Tao Mei,et al.  Correlative multi-label video annotation , 2007, ACM Multimedia.

[33]  Georges Quénot,et al.  A comparative study for multiple visual concepts detection in images and videos , 2015, Multimedia Tools and Applications.

[34]  Lilly Suriani Affendey,et al.  Developing context model supporting spatial relations for semantic video retrieval , 2010, 2010 International Conference on Information Retrieval & Knowledge Management (CAMP).

[35]  Georges Quénot,et al.  Re-ranking by local re-scoring for video indexing and retrieval , 2011, CIKM '11.

[36]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[37]  Dong Xu,et al.  Columbia University TRECVID-2006 Video Search and High-Level Feature Extraction , 2006, TRECVID.

[38]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[39]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Tommy W. S. Chow,et al.  Object-Level Video Advertising: An Optimization Framework , 2017, IEEE Transactions on Industrial Informatics.

[41]  Koen E. A. van de Sande,et al.  A comparison of color features for visual concept classification , 2008, CIVR '08.

[42]  Marcel Worring,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Harvesting Social Images for Bi-Concept Search , 2022 .

[43]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[44]  Jorma Laaksonen,et al.  PicSOM Experiments in TRECVID 2018 , 2015, TRECVID.

[45]  B. S. Manjunath,et al.  A texture descriptor for browsing and similarity retrieval , 2000, Signal Process. Image Commun..

[46]  Yi Wu,et al.  Ontology-based multi-classification learning for video concept detection , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[47]  Georges Quénot,et al.  Two-layers re-ranking approach based on contextual information for visual concepts detection in videos , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[48]  Sharath Pankanti,et al.  IBM Research and Columbia University TRECVID-2013 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), Surveillance Event Detection (SED), and Semantic Indexing (SIN) Systems , 2013, TRECVID.

[49]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[50]  Alexei A. Efros,et al.  An empirical study of context in object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[52]  Tommy W. S. Chow,et al.  Tree2Vector: Learning a Vectorial Representation for Tree-Structured Data , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[53]  Sunitha Abburu Context Ontology Construction For Cricket Video , 2010 .

[54]  Serge J. Belongie,et al.  Context based object categorization: A critical survey , 2010, Comput. Vis. Image Underst..

[55]  Wei-Hao Lin,et al.  Confounded Expectations: Informedia at TRECVID 2004 , 2004, TRECVID.

[56]  Ramin Zabih,et al.  Comparing images using color coherence vectors , 1997, MULTIMEDIA '96.

[57]  Chong-Wah Ngo,et al.  Concept-Driven Multi-Modality Fusion for Video Search , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[58]  Denis Pellerin,et al.  Learned features versus engineered features for multimedia indexing , 2017, Multimedia Tools and Applications.

[59]  Rong Yan,et al.  The combination limit in multimedia retrieval , 2003, MULTIMEDIA '03.

[60]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[61]  Georges Quénot,et al.  Evaluations of multi-learner approaches for concept indexing in video documents , 2010, RIAO.

[62]  Lior Wolf,et al.  A Critical View of Context , 2006, International Journal of Computer Vision.

[63]  Djoerd Hiemstra,et al.  A probabilistic ranking framework using unobservable binary events for video search , 2008, CIVR '08.