Cross-modal categorisation of user-generated video sequences

This paper describes the possibilities of cross-modal classification of multimedia documents in social media platforms. Our framework predicts the user-chosen category of consumer-produced video sequences based on their textual and visual features. These text resources---includes metadata and automatic speech recognition transcripts---are represented as bags of words and the video content is represented as a bag of clustered local visual features. The contribution of the different modalities is investigated and how they should be combined if sequences lack certain resources. Therefore, several classification methods are evaluated, varying the resources. The paper shows an approach that achieves a mean average precision of 0.3977 using user-contributed metadata in combination with clustered SURF.

[1]  Patrick Lambert,et al.  Content-Based Video Description for Automatic Video Genre Categorization , 2012, MMM.

[2]  T. Sikora,et al.  A hierarchical, multi-modal approach for placing videos on the map using millions of Flickr photographs , 2011, SBNMA '11.

[3]  Ning Zhang,et al.  An efficient framework on large-scale video genre classification , 2010, 2010 IEEE International Workshop on Multimedia Signal Processing.

[4]  Markus Koch,et al.  TubeFiler: an automatic web video categorizer , 2009, ACM Multimedia.

[5]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6]  Rainer Stiefelhagen,et al.  Content-based video genre classification using multiple cues , 2010, AIEMPro '10.

[7]  T. Sikora,et al.  An automatic system for real-time video-genres detection using high-level-descriptors and a set of classifiers , 2008, 2008 IEEE International Symposium on Consumer Electronics.

[8]  Diane J. Cook,et al.  Automatic Video Classification: A Survey of the Literature , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[9]  Zhen Li,et al.  Beyond bag of words: Combining generative and discriminative models for natural scene categorization , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Jean-Luc Gauvain,et al.  Speech Processing for Audio Indexing , 2008, GoTAL.

[12]  Mohammad Soleymani,et al.  Automatic tagging and geotagging in video collections and communities , 2011, ICMR.

[13]  GeversTheo,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010 .

[14]  Martha Larson,et al.  Overview of MediaEval 2011 Rich Speech Retrieval Task and Genre Tagging Task , 2011, MediaEval.

[15]  Xian-Sheng Hua,et al.  Multi-modality web video categorization , 2007, MIR '07.

[16]  Yiannis S. Boutalis,et al.  CEDD: Color and Edge Directivity Descriptor: A Compact Descriptor for Image Indexing and Retrieval , 2008, ICVS.

[17]  Thomas Sikora,et al.  Feature-based video key frame extraction for low quality video sequences , 2009, 2009 10th Workshop on Image Analysis for Multimedia Interactive Services.

[18]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..