Visual Concept Detection in Images and Videos

The rapidly increasing proliferation of digital images and videos leads to a situation where content-based search in multimedia databases becomes more and more important. A prerequisite for effective image and video search is to analyze and index media content automatically. Current approaches in the field of image and video retrieval focus on semantic concepts serving as an intermediate description to bridge the “semantic gap” between the data representation and the human interpretation. Due to the large complexity and variability in the appearance of visual concepts, the detection of arbitrary concepts represents a very challenging task. In this thesis, the following aspects of visual concept detection systems are addressed: First, enhanced local descriptors for mid-level feature coding are presented. Based on the observation that scale-invariant feature transform (SIFT) descriptors with different spatial extents yield large performance differences, a novel concept detection system is proposed that combines feature representations for different spatial extents using multiple kernel learning (MKL). A multi-modal video concept detection system is presented that relies on Bag-of-Words representations for visual and in particular for audio features. Furthermore, a method for the SIFT-based integration of color information, called color moment SIFT, is introduced. Comparative experimental results demonstrate the superior performance of the proposed systems on the Mediamill and on the VOC Challenge. Second, an approach is presented that systematically utilizes results of object detectors. Novel object-based features are generated based on object detection results using different pooling strategies. For videos, detection results are assembled to object sequences and a shot-based confidence score as well as further features, such as position, frame coverage or movement, are computed for each object class. These features are used as additional input for the support vector machine (SVM)-based concept classifiers. Thus, other related concepts can also profit from object-based features. Extensive experiments on the Mediamill, VOC and TRECVid Challenge show significant improvements in terms of retrieval performance not only for the object classes, but also in particular for a large number of indirectly related concepts. Moreover, it has been demonstrated that a few object-based features are beneficial for a large number of concept classes. On the VOC Challenge, the additional use of object-based features led to a superior performance for the image classification task of 63.8% mean average precision

[1]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[2]  Koichi Shinoda,et al.  A fast MAP adaptation technique for gmm-supervector-based video semantic indexing systems , 2011, ACM Multimedia.

[3]  Benoit Mory,et al.  Video motion representation for improved content access , 2000, 2000 Digest of Technical Papers. International Conference on Consumer Electronics. Nineteenth in the Series (Cat. No.00CH37102).

[4]  Bernd Freisleben,et al.  University of Marburg at TRECVID 2006: Shot Boundary Detection and Rushes Task Results , 2006, TRECVID.

[5]  Bernd Freisleben,et al.  Robust Video Content Analysis via Transductive Learning , 2012, TIST.

[6]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[7]  M. Jay Norton,et al.  Knowledge Discovery in Databases , 1999, Libr. Trends.

[8]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[9]  Bernd Freisleben,et al.  University of Marburg at TRECVID 2007: Shot Boundary Detection and High Level Feature Extraction , 2007, TRECVID.

[10]  Antonio Criminisi,et al.  Harvesting Image Databases from the Web , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Bernd Freisleben,et al.  Videana: A Software Toolkit for Scientific Film Studies , 2009, Digital Tools in Media Studies.

[12]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[13]  Christopher J. C. Burges,et al.  Spectral clustering and transductive learning with multiple views , 2007, ICML '07.

[14]  Joydeep Ghosh,et al.  A text retrieval approach to content-based audio retrieval , 2008 .

[15]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[16]  Rainer Lienhart,et al.  Scene Determination Based on Video and Audio Features , 2004, Multimedia Tools and Applications.

[17]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[18]  Bernd Freisleben,et al.  Semantic video analysis for psychological research on violence in computer games , 2007, CIVR '07.

[19]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[20]  Nicolai Petkov,et al.  Nonlinear operator for oriented texture , 1999, IEEE Trans. Image Process..

[21]  Meng Wang,et al.  Correlative multilabel video annotation with temporal kernels , 2008, TOMCCAP.

[22]  Hervé Bredin IRIT @ TRECVid 2010 : Hidden Markov Models for Context-aware Late Fusion of Multiple Audio Classifiers , 2010, TRECVID.

[23]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[25]  Bernd Freisleben,et al.  Video Cut Detection without Thresholds , 2004 .

[26]  C. Schmid,et al.  Indexing based on scale invariant interest points , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[27]  Bernd Freisleben,et al.  Estimation of arbitrary camera motion in MPEG videos , 2004, ICPR 2004.

[28]  Bernd Freisleben,et al.  University of Marburg at TRECVID 2008: High-Level Feature Extraction , 2008, TRECVID.

[29]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[30]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[31]  Jitendra Malik,et al.  Object detection using a max-margin Hough transform , 2009, CVPR.

[32]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[33]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[34]  David A. McAllester,et al.  Cascade object detection with deformable part models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[36]  Koichi Shinoda,et al.  High-Level Feature Extraction Using SIFT GMMs and Audio Models , 2010, 2010 20th International Conference on Pattern Recognition.

[37]  Bernd Freisleben,et al.  Multi-class Object Detection with Hough Forests Using Local Histograms of Visual Words , 2011, CAIP.

[38]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[39]  Fei-Fei Li,et al.  OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Wolfgang Effelsberg,et al.  VisualGREP: a systematic method to compare and retrieve video sequences , 1997, Electronic Imaging.

[41]  Frédéric Jurie,et al.  Modeling spatial layout with fisher vectors for image categorization , 2011, 2011 International Conference on Computer Vision.

[42]  Bernd Freisleben,et al.  Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning , 2012, MMM.

[43]  Yi Hu,et al.  Subjective Comparison of Speech Enhancement Algorithms , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[44]  Jorma Laaksonen,et al.  Experiments on Selection of Codebooks for Local Image Feature Histograms , 2008, VISUAL.

[45]  Andrew Zisserman,et al.  Scene Classification Using a Hybrid Generative/Discriminative Approach , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Lie Lu,et al.  Audio Keywords Discovery for Text-Like Audio Content Analysis and Retrieval , 2008, IEEE Transactions on Multimedia.

[47]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[48]  George Stephanopoulos,et al.  The Multimedia Understanding Group at TRECVID 2010 , 2010, TRECVID.

[49]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[50]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[51]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[52]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[53]  Antonio Torralba,et al.  Sharing Visual Features for Multiclass and Multiview Object Detection , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Bernd Freisleben,et al.  Text detection in images based on unsupervised classification of high-frequency wavelet coefficients , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[55]  Vijay Kumar A Discriminative Voting Scheme for Object Detection using Hough Forests , 2010 .

[56]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[57]  Qian Liu,et al.  Deriving semantic terms for images by mining the web , 2009, ICEC.

[58]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[59]  Bernd Freisleben,et al.  On the Spatial Extents of SIFT Descriptors for Visual Concept Detection , 2011, ICVS.

[60]  Hervé Glotin,et al.  IRIM at TRECVID 2014: Semantic Indexing and Instance Search , 2014, TRECVID.

[61]  Cor J. Veenman,et al.  Kernel Codebooks for Scene Categorization , 2008, ECCV.

[62]  Cor J. Veenman,et al.  Comparing compact codebooks for visual categorization , 2010, Comput. Vis. Image Underst..

[63]  Honglak Lee,et al.  Unsupervised learning of hierarchical representations with convolutional deep belief networks , 2011, Commun. ACM.

[64]  Bernd Freisleben,et al.  Self-Supervised Learning of Face Appearances in TV Casts and Movies , 2007, Int. J. Semantic Comput..

[65]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[66]  Koichi Shinoda,et al.  TokyoTech+Canon at TRECVID 2011 , 2011, TRECVID.

[67]  Koichi Shinoda,et al.  A Fast and Accurate Video Semantic-Indexing System Using Fast MAP Adaptation and GMM Supervectors , 2012, IEEE Transactions on Multimedia.

[68]  Xian-Sheng Hua,et al.  Transductive video annotation via local learnable kernel classifier , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[69]  Bernd Freisleben,et al.  Eine service-orientierte Grid-Infrastruktur zur Unterstützung medienwissenschaftlicher Filmanalyse , 2009, GeNeMe.

[70]  Marcel Worring,et al.  The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  Alexander Zien,et al.  lp-Norm Multiple Kernel Learning , 2011, J. Mach. Learn. Res..

[72]  James Ze Wang,et al.  Tagging over time: real-world image annotation by lightweight meta-learning , 2007, ACM Multimedia.

[73]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[74]  Qi Tian,et al.  A unified framework for semantic shot representation of sports video , 2005, MIR '05.

[75]  Thilo Stadelmann,et al.  Voice modeling methods for automatic speaker recognition , 2010 .

[76]  Koen E. A. van de Sande,et al.  A comparison of color features for visual concept classification , 2008, CIVR '08.

[77]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[78]  Krystian Mikolajczyk,et al.  Spatial Coordinate Coding to reduce histogram representations, Dominant Angle and Colour Pyramid Match , 2011, 2011 18th IEEE International Conference on Image Processing.

[79]  Gunnar Rätsch,et al.  The SHOGUN Machine Learning Toolbox , 2010, J. Mach. Learn. Res..

[80]  Stephan Kopf,et al.  Computergestützte Inhaltsanalyse von digitalen Videoarchiven , 2006 .

[81]  Shuicheng Yan,et al.  Inferring semantic concepts from community-contributed images and noisy tags , 2009, ACM Multimedia.

[82]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[83]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[84]  Bernd Freisleben,et al.  University of Marburg at TRECVID 2010: Semantic Indexing , 2010, TRECVID.

[85]  Wolfgang Effelsberg,et al.  The MoCA Workbench: support for creativity in movie content analysis , 1996, Proceedings of the Third IEEE International Conference on Multimedia Computing and Systems.

[86]  Bernd Freisleben,et al.  Improving Semantic Video Retrieval via Object-Based Features , 2009, 2009 IEEE International Conference on Semantic Computing.

[87]  Jason Weston,et al.  A user's guide to support vector machines. , 2010, Methods in molecular biology.

[88]  John R. Smith,et al.  On the detection of semantic concepts at TRECVID , 2004, MULTIMEDIA '04.

[89]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[90]  Changhu Wang,et al.  Scalable search-based image annotation of personal images , 2006, MIR '06.

[91]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[92]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[93]  Paul Over,et al.  High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[94]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[95]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[96]  Frédéric Jurie,et al.  Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.

[97]  Chong-Wah Ngo,et al.  On the Annotation of Web Videos by Efficient Near-Duplicate Search , 2010, IEEE Transactions on Multimedia.

[98]  Christoph H. Lampert,et al.  Unsupervised Object Discovery: A Comparison , 2010, International Journal of Computer Vision.

[99]  Xian-Sheng Hua,et al.  Transductive Inference with Hierarchical Clustering for Video Annotation , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[100]  Klaus-Robert Müller,et al.  Efficient and Accurate Lp-Norm Multiple Kernel Learning , 2009, NIPS.

[101]  Bernt Schiele,et al.  Robust Object Detection with Interleaved Categorization and Segmentation , 2008, International Journal of Computer Vision.

[102]  Dong Xu,et al.  Columbia University TRECVID-2006 Video Search and High-Level Feature Extraction , 2006, TRECVID.

[103]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[104]  Peter L. Bartlett,et al.  A Unifying View of Multiple Kernel Learning , 2010, ECML/PKDD.

[105]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[106]  Adel M. Alimi,et al.  Audio stream analysis for environmental sound classification , 2011, 2011 International Conference on Multimedia Computing and Systems.

[107]  K. Mathiak,et al.  Toward brain correlates of natural behavior: fMRI during violent video games , 2006, Human brain mapping.

[108]  Gerald Kühne,et al.  Motion-based segmentation and classification of video objects , 2002 .

[109]  J.-Y. Bouguet,et al.  Pyramidal implementation of the lucas kanade feature tracker , 1999 .

[110]  Lie Lu,et al.  Digital Object Identifier (DOI) 10.1007/s00530-002-0065-0 Multimedia Systems , 2003 .

[111]  Juergen Gall,et al.  Class-specific Hough forests for object detection , 2009, CVPR.

[112]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[113]  Luhong Liang,et al.  A detector tree of boosted classifiers for real-time object detection and tracking , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[114]  Thomas S. Huang,et al.  Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[115]  Marcel Worring,et al.  Learning Visual Contexts for Image Annotation From Flickr Groups , 2011, IEEE Transactions on Multimedia.

[116]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[117]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[118]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[119]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[120]  Noel E. O'Connor,et al.  Event detection in field sports video using audio-visual features and a support vector Machine , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[121]  Rong Yan,et al.  How many high-level concepts will fill the semantic gap in news video retrieval? , 2007, CIVR '07.

[122]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[123]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[124]  Patrick Ndjiki-Nya,et al.  Spatial codebooks for image categorization , 2011, ICMR '11.

[125]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[126]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[127]  Luc Van Gool,et al.  Hough Transform-based Mouth Localization for Audio-visual Speech Recognition , 2009, BMVC.

[128]  Bernd Freisleben,et al.  Long-Term Incremental Web-Supervised Learning of Visual Concepts via Random Savannas , 2012, IEEE Transactions on Multimedia.

[129]  Paul Over,et al.  TRECVID 2006 Overview , 2006, TRECVID.

[130]  B. S. Manjunath,et al.  Color and texture descriptors , 2001, IEEE Trans. Circuits Syst. Video Technol..

[131]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[132]  Wolfgang Effelsberg,et al.  Saliency detection for stereoscopic video , 2013, MMSys.

[133]  Lee Wilkins Deciding What's News: A Study of CBS Evening News, NBC Nightly News, Newsweek, and Time , 2005 .

[134]  Zhiwu Lu,et al.  Semantic concept annotation based on audio PLSA model , 2009, MM '09.

[135]  Wolfgang Effelsberg,et al.  Video abstracting , 1997, CACM.

[136]  Adel M. Alimi,et al.  REGIMVID at TRECVID2010: Semantic Indexing , 2010, TRECVID.