An efficient concept detection system via sparse ensemble learning

Abstract In this paper, we present an efficient concept detection system based on a novel bag of words extraction method and sparse ensemble learning. The presented system can efficiently build the concept detectors upon large scale image dataset, and achieve real-time concept detection on unseen images with the state-of-the-arts accuracy. To do so, we first develop an efficient bag of visual words (BoW) construction method based on sparse non-negative matrix factorization (NMF) and GPU enabled SIFT feature extraction. We then develop a sparse ensemble learning method to build the detection model, which drastically reduces learning time in order of magnitude over traditional methods like Support Vector Machine. To overcome the difficulty of manual annotation of training dataset, we construct a large training set with both pseudo relevance feedback of negative samples and interactive feedback of positive samples. Experiments on TRECVID 2012 dataset and MIRFlickr-1M dataset show both efficiency and effectiveness of our system.

[1]  Tao Mei,et al.  Correlative multi-label video annotation , 2007, ACM Multimedia.

[2]  Yung-Yu Chuang,et al.  Multi-cue fusion for semantic video indexing , 2008, ACM Multimedia.

[3]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[4]  Chong-Wah Ngo,et al.  Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study , 2010, IEEE Transactions on Multimedia.

[5]  Adrian Ulges,et al.  Adapting Web-based Video Concept Detectors for Different Target Domains , 2013 .

[6]  S. Poovizhi,et al.  An Implementation of Scale Invariant Feature Transform (SIFT) Algorithm Using Content Based Image Retrieval , 2014 .

[7]  Sheng Tang,et al.  Pornprobe: an LDA-SVM based pornography detection system , 2009, ACM Multimedia.

[8]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[9]  Sheng Tang,et al.  Sparse Ensemble Learning for Concept Detection , 2012, IEEE Transactions on Multimedia.

[10]  Chong-Wah Ngo,et al.  VIREO/DVMM at TRECVID 2009: High-Level Feature Extraction, Automatic Video Search, and Content-Based Copy Detection , 2009, TRECVID.

[11]  Stéphane Ayache,et al.  Classifier Fusion for SVM-Based Multimedia Semantic Indexing , 2007, ECIR.

[12]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[13]  Paul Over,et al.  High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[14]  Bart Thomee,et al.  New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[15]  Meng Wang,et al.  Unified Video Annotation via Multigraph Learning , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[17]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[18]  Adrian Popescu,et al.  Building Reliable and Reusable Test Collections for Image Retrieval: The Wikipedia Task at ImageCLEF , 2012, IEEE MultiMedia.

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  Dariu Gavrila,et al.  Monocular Pedestrian Detection: Survey and Experiments , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[22]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[23]  Yugang Jiang Large scale semantic concept detection, fusion, and selection for domain adaptive video search , 2009 .

[24]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[25]  Yihong Gong,et al.  Automatic parsing and indexing of news video , 1995, Multimedia Systems.

[26]  Sheng Tang,et al.  Localized Multiple Kernel Learning for Realistic Human Action Recognition in Videos , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Dariu Gavrila,et al.  An Experimental Study on Pedestrian Classification , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Sheng Tang,et al.  TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS , 2007, TRECVID.

[29]  Sheng Tang,et al.  Ensemble Learning with LDA Topic Models for Visual Concept Detection , 2012 .

[30]  Anil K. Jain,et al.  On image classification: city images vs. landscapes , 1998, Pattern Recognit..

[31]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[32]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[33]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[34]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[35]  Markus A. Stricker,et al.  Similarity of color images , 1995, Electronic Imaging.

[36]  Martin Szummer,et al.  Indoor-outdoor image classification , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[37]  Sanjeev Khudanpur,et al.  TRECVID 2005 Experiment at Johns Hopkins University: Using Hidden Markov Models for Video Retrieval , 2005, TRECVID.

[38]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Rong Yan,et al.  How many high-level concepts will fill the semantic gap in news video retrieval? , 2007, CIVR '07.

[40]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[41]  Lifeng Sun,et al.  Auto-cut for web images , 2009, MM '09.

[42]  Sheng Tang,et al.  MovieBase: a movie database for event detection and behavioral analysis , 2009, WSMC '09.

[43]  Chong-Wah Ngo,et al.  Columbia University/VIREO-CityU/IRIT TRECVID2008 High-Level Feature Extraction and Interactive Video Search , 2008, TRECVID.

[44]  Shih-Fu Chang,et al.  Cross-domain learning methods for high-level visual concept classification , 2008, 2008 15th IEEE International Conference on Image Processing.

[45]  Meng Wang,et al.  Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation , 2009, IEEE Transactions on Multimedia.

[46]  Shih-Fu Chang,et al.  Visually Searching the Web for Content , 1997, IEEE Multim..

[47]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[48]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.