Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study

Based on the local keypoints extracted as salient image patches, an image can be described as a ¿bag-of-visual-words (BoW)¿ and this representation has appeared promising for object and scene classification. The performance of BoW features in semantic concept detection for large-scale multimedia databases is subject to various representation choices. In this paper, we conduct a comprehensive study on the representation choices of BoW, including vocabulary size, weighting scheme, stop word removal, feature selection, spatial information, and visual bi-gram. We offer practical insights in how to optimize the performance of BoW by choosing appropriate representation choices. For the weighting scheme, we elaborate a soft-weighting method to assess the significance of a visual word to an image. We experimentally show that the soft-weighting outperforms other popular weighting schemes such as TF-IDF with a large margin. Our extensive experiments on TRECVID data sets also indicate that BoW feature alone, with appropriate representation choices, already produces highly competitive concept detection performance. Based on our empirical findings, we further apply our method to detect a large set of 374 semantic concepts. The detectors, as well as the features and detection scores on several recent benchmark data sets, are released to the multimedia community.

[1]  Sebastian Nowozin,et al.  Weighted Substructure Mining for Image Analysis , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[3]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[4]  Sheng Tang,et al.  TRECVID 2008 Participation by MCG-ICT-CAS , 2008, TRECVID.

[5]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[7]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[8]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[9]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[11]  Shih-Fu Chang,et al.  Columbia University’s Baseline Detectors for 374 LSCOM Semantic Visual Concepts , 2007 .

[12]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[13]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Frédéric Jurie,et al.  Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.

[15]  Djoerd Hiemstra,et al.  A probabilistic ranking framework using unobservable binary events for video search , 2008, CIVR '08.

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[18]  Cordelia Schmid,et al.  A maximum entropy framework for part-based texture and object recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[19]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[20]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[21]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[22]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[23]  Patrick Haffner,et al.  Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[24]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[25]  Paul Over,et al.  TREC video retrieval evaluation TRECVID , 2008 .

[26]  Francesca Odone,et al.  Building kernels from binary strings for image matching , 2005, IEEE Transactions on Image Processing.

[27]  Chong-Wah Ngo,et al.  Columbia University/VIREO-CityU/IRIT TRECVID2008 High-Level Feature Extraction and Interactive Video Search , 2008, TRECVID.

[28]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[29]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[30]  Chong-Wah Ngo,et al.  Fusing semantics, observability, reliability and diversity of concept detectors for video search , 2008, ACM Multimedia.

[31]  Chong-Wah Ngo,et al.  Video event detection using motion relativity and visual relatedness , 2008, ACM Multimedia.

[32]  Jitendra Malik,et al.  Geometric blur for template matching , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[33]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[34]  Koen E. A. van de Sande,et al.  A comparison of color features for visual concept classification , 2008, CIVR '08.

[35]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[36]  Alexander G. Hauptmann,et al.  LSCOM Lexicon Definitions and Annotations (Version 1.0) , 2006 .

[37]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[38]  Jitendra Malik,et al.  Detecting Categories in News Video Using Acoustic, Speech, and Image Features , 2006, TRECVID.

[39]  Rong Yan,et al.  Multi-Lingual Broadcast News Retrieval , 2006, TRECVID.

[40]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[41]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Winston H. Hsu,et al.  Video Search and High-Level Feature Extraction , 2005 .

[43]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Cordelia Schmid,et al.  Coloring Local Feature Extraction , 2006, ECCV.

[46]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, CVPR.