Ensemble Learning with LDA Topic Models for Visual Concept Detection

With the rapid growth of multimedia application technologies and network technologies, especially the proliferation of Web 2.0 and digital cameras, there has been an explosion of images and videos in the Internet. For example, the volume of videos uploaded to the YouTube every minute is amounting to 48 hours by May 2011, having doubled in the last two years. Such huge video collections hold useful yet implicit and nontrivial knowledge about various domains. To manage and utilize these resources effectively, video concept detection becomes a very important subject of intensive research by a large research community (Over et al., 2008). It is an integral part of visual data mining that is automatically extracting such knowledge from the huge unstructured visual data. It aims to automatically annotate video shots or keyframes with respect to a semantic concept (Tang et al., 2012). Ranging from objects like airplane and car to scenes like urban street and sky, semantic concepts serve as good intermediate semantic features for video content indexing and understanding, and thus, spurring much research attention (Jiang et al., 2010; Naphade & Smith, 2004; Snoek et al., 2006; Zheng et al., 2008). Essentially, concept detection is a classification task, in which a binary classifier is usually learned to predict the presence of a certain concept in a video shot or keyframe (image). Traditional concept detection methods are mainly global classification: use supervised machine learning techniques, such as single Support Vector Machine (SVM), etc., over whole training dataset.

[1]  Tao Mei,et al.  Correlative multi-label video annotation , 2007, ACM Multimedia.

[2]  Adrian Ulges,et al.  Adapting Web-based Video Concept Detectors for Different Target Domains , 2013 .

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  Laurent Itti,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Rapid Biologically-inspired Scene Classification Using Features Shared with Visual Attention , 2022 .

[5]  Stéphane Ayache,et al.  Video Corpus Annotation Using Active Learning , 2008, ECIR.

[6]  John R. Smith,et al.  On the detection of semantic concepts at TRECVID , 2004, MULTIMEDIA '04.

[7]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[8]  Yves Lecourtier,et al.  Controlling the diversity in classifier ensembles through a measure of agreement , 2005, Pattern Recognit..

[9]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[10]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[11]  Yugang Jiang Large scale semantic concept detection, fusion, and selection for domain adaptive video search , 2009 .

[12]  Sheng Tang,et al.  Sparse Ensemble Learning for Concept Detection , 2012, IEEE Transactions on Multimedia.

[13]  Zhouyu Fu,et al.  Recognition of Pornographic Web Pages by Classifying Texts and Images , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Mubarak Shah,et al.  Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , 2010, TRECVID.

[15]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Chong-Wah Ngo,et al.  VIREO/DVMM at TRECVID 2009: High-Level Feature Extraction, Automatic Video Search, and Content-Based Copy Detection , 2009, TRECVID.

[17]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[18]  Dariu Gavrila,et al.  An Experimental Study on Pedestrian Classification , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[20]  S. Thorpe,et al.  Speed of processing in the human visual system , 1996, Nature.

[21]  Bart Thomee,et al.  New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[22]  Stéphane Ayache,et al.  Classifier Fusion for SVM-Based Multimedia Semantic Indexing , 2007, ECIR.

[23]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[24]  Frédéric Jurie,et al.  Creating efficient codebooks for visual recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[25]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[27]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[28]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, CVPR Workshops.

[29]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[30]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[31]  Xiaogang Wang,et al.  Action Recognition Using Topic Models , 2011, Visual Analysis of Humans.

[32]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[33]  Dariu Gavrila,et al.  Monocular Pedestrian Detection: Survey and Experiments , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Rong Yan,et al.  How many high-level concepts will fill the semantic gap in news video retrieval? , 2007, CIVR '07.

[35]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[36]  Daniel Barbará,et al.  Embedding Semantics in LDA Topic Models , 2010 .

[37]  Shih-Fu Chang,et al.  Visually Searching the Web for Content , 1997, IEEE Multim..

[38]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[39]  Sheng Tang,et al.  Pornprobe: an LDA-SVM based pornography detection system , 2009, ACM Multimedia.

[40]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[41]  David A. Forsyth,et al.  Automatic Detection of Human Nudes , 1999, International Journal of Computer Vision.

[42]  Adrian Ulges,et al.  Keyframe Extraction for Video Tagging & Summarization , 2008, Informatiktage.

[43]  Ramesh C. Jain,et al.  Image annotation by kNN-sparse graph-based label propagation over noisily tagged web images , 2011, TIST.

[44]  Sanjeev Khudanpur,et al.  TRECVID 2005 Experiment at Johns Hopkins University: Using Hidden Markov Models for Video Retrieval , 2005, TRECVID.

[45]  Sheng Tang,et al.  Localized Multiple Kernel Learning for Realistic Human Action Recognition in Videos , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[46]  Anil K. Jain,et al.  On image classification: city images vs. landscapes , 1998, Pattern Recognit..

[47]  Fei-FeiLi,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008 .

[48]  Yang Liu,et al.  IMAGE GUARDER: AN INTELLIGENT DETECTOR FOR ADULT IMAGES , 2003 .

[49]  Chong-Wah Ngo,et al.  Columbia University/VIREO-CityU/IRIT TRECVID2008 High-Level Feature Extraction and Interactive Video Search , 2008, TRECVID.

[50]  Shih-Fu Chang,et al.  Cross-domain learning methods for high-level visual concept classification , 2008, 2008 15th IEEE International Conference on Image Processing.

[51]  Yihong Gong,et al.  Automatic parsing and indexing of news video , 1995, Multimedia Systems.

[52]  Yuchou Chang,et al.  Unsupervised Video Shot Detection Using Clustering Ensemble with a Color Global Scale-Invariant Feature Transform Descriptor , 2008, EURASIP J. Image Video Process..

[53]  Meng Wang,et al.  Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation , 2009, IEEE Transactions on Multimedia.

[54]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[55]  Tat-Seng Chua,et al.  Image Annotation by Graph-Based Inference With Integrated Multiple/Single Instance Representations , 2010, IEEE Transactions on Multimedia.

[56]  Markus A. Stricker,et al.  Similarity of color images , 1995, Electronic Imaging.

[57]  Martin Szummer,et al.  Indoor-outdoor image classification , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[58]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[59]  Sheng Tang,et al.  MovieBase: a movie database for event detection and behavioral analysis , 2009, WSMC '09.

[60]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[61]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[62]  Shumeet Baluja,et al.  Large scale image-based adult-content filtering , 2006, VISAPP.

[63]  Paul Over,et al.  Video shot boundary detection: Seven years of TRECVid activity , 2010, Comput. Vis. Image Underst..

[64]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[65]  Sheng Tang,et al.  Multi-Modality Transfer Based on Multi-Graph Optimization for Domain Adaptive Video Concept Annotation , 2010, 2010 Fourth Pacific-Rim Symposium on Image and Video Technology.

[66]  David W. Opitz,et al.  Actively Searching for an E(cid:11)ective Neural-Network Ensemble , 1996 .

[67]  David W. Opitz,et al.  Generating Accurate and Diverse Members of a Neural-Network Ensemble , 1995, NIPS.

[68]  Meng Wang,et al.  Unified Video Annotation via Multigraph Learning , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[69]  Shuicheng Yan,et al.  Near-duplicate keyframe retrieval by nonrigid image matching , 2008, ACM Multimedia.

[70]  Sheng Tang,et al.  TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS , 2007, TRECVID.

[71]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[72]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[73]  Lior Rokach,et al.  Pattern Classification Using Ensemble Methods , 2009, Series in Machine Perception and Artificial Intelligence.

[74]  Yung-Yu Chuang,et al.  Multi-cue fusion for semantic video indexing , 2008, ACM Multimedia.

[75]  Prasenjit Mitra,et al.  Event detection with spatial latent Dirichlet allocation , 2011, JCDL '11.

[76]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[77]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[78]  Cor J. Veenman,et al.  The influence of cross-validation on video classification performance , 2006, MM '06.

[79]  Yang Jing,et al.  A pseudo relevance feedback based cross domain video concept detection , 2011, ICIMCS 2011.

[80]  W. Eric L. Grimson,et al.  Spatial Latent Dirichlet Allocation , 2007, NIPS.

[81]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[82]  Paul Over,et al.  TRECVID 2008 - Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2010, TRECVID.

[83]  Paul Over,et al.  High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[84]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[85]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[86]  Chong-Wah Ngo,et al.  Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study , 2010, IEEE Transactions on Multimedia.

[87]  Qi Tian,et al.  Probabilistic optimized ranking for multimedia semantic concept detection via RVM , 2008, CIVR '08.

[88]  Yang Wang,et al.  Human Action Recognition by Semilatent Topic Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[89]  Matti Pietikäinen,et al.  A comparative study of texture measures with classification based on featured distributions , 1996, Pattern Recognit..

[90]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[91]  Li Zhang,et al.  Sparse ensembles using weighted combination methods based on linear programming , 2011, Pattern Recognit..

[92]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[93]  Bo Zhang,et al.  A Formal Study of Shot Boundary Detection , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[94]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.