Capturing Text Semantics for Concept Detection in News Video

The overwhelming amounts of multimedia contents have triggered the need for automatic semantic concept detection. However, as there are large variations in the visual feature space, text from automatic speech recognition (ASR) has been extensively used and found to be effective to complement visual features in the concept detection task. Generally, there are two common text analysis methods. One is text classification and the other is text retrieval. Both methods have their own strengths and weaknesses. In addition, fusion of text and visual analysis is still an open problem. In this paper, we present a novel multiresolution, multisource and multimodal (M3) transductive learning framework. We fuse text and visual features via a multiresolution model. This is because different modal features only work well in different temporal resolutions, which exhibit different types of semantics. We perform a multiresolution analysis at the shot, multimedia discourse, and story levels to capture the semantics in a news video. While visual features play a dominant role at the shot level, text plays an increasingly important role as we move from the multimedia discourse towards the story levels. Our multisource inference transductive model provides a solution to combine text classification and retrieval method together. We test our M3 transductive model of semantic concept detection on the TRECVID 2004 dataset. Preliminary results demonstrate that our approach is effective.

[1]  Chew Lim Tan,et al.  Proposing a New Term Weighting Scheme for Text Categorization , 2006, AAAI.

[2]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[3]  Jun Wu,et al.  Tsinghua University at TRECVID 2004: Shot Boundary Detection and High-Level Feature Extraction , 2004, TRECVID.

[4]  Ajay Divakaran,et al.  Broadcast Video Content Segmentation by Supervised Learning , 2009 .

[5]  Chris D. Paice,et al.  Constructing literature abstracts by computer: Techniques and prospects , 1990, Inf. Process. Manag..

[6]  John R. Smith,et al.  On the detection of semantic concepts at TRECVID , 2004, MULTIMEDIA '04.

[7]  Tat-Seng Chua,et al.  National University of Singapore at the TREC 13 Question Answering Main Task , 2004, TREC.

[8]  Martha Alice Hearst Context and structure in automated full-text information access , 1994 .

[9]  Ramesh C. Jain,et al.  ACM SIGMM retreat report on future directions in multimedia research , 2005, TOMCCAP.

[10]  Malcolm Slaney,et al.  Multimedia edges: finding hierarchy in all dimensions , 2001, MULTIMEDIA '01.

[11]  Gang Wang,et al.  TRECVID 2004 Search and Feature Extraction Task by NUS PRIS , 2004, TRECVID.

[12]  Marchenko Yelizaveta,et al.  Transductive inference using multiple experts for brushwork annotation in paintings domain , 2006, MM 2006.

[13]  Tobun Dorbin Ng,et al.  Informedia at TRECVID 2003 : Analyzing and Searching Broadcast News Video , 2003, TRECVID.

[14]  Chin-Yew Lin,et al.  Robust automated topic identification , 1997 .

[15]  Shih-Fu Chang,et al.  Story boundary detection in large broadcast news video archives: techniques, experience and trends , 2004, MULTIMEDIA '04.

[16]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[17]  Jun Yang,et al.  Finding Person X: Correlating Names with Visual Appearances , 2004, CIVR.

[18]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[19]  Sadao Kurohashi,et al.  Unsupervised Topic Identification by Integrating Linguistic and Visual Information Based on Hidden Markov Models , 2006, ACL.

[20]  James Ze Wang,et al.  Learning-based linguistic indexing of pictures with 2--d MHMMs , 2002, MULTIMEDIA '02.

[21]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[22]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[23]  Shih-Fu Chang Recent Advances and Open Issues of Digital Image/Video Search , 2007, Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS '07).

[24]  Ramesh C. Jain,et al.  Transductive inference using multiple experts for brushwork annotation in paintings domain , 2006, ACM Multimedia.

[25]  Udo Hahn,et al.  Topic parsing: Accounting for text macro structures in full-text analysis , 1990, Inf. Process. Manag..

[26]  Xian-Sheng Hua,et al.  Transductive Inference with Hierarchical Clustering for Video Annotation , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[27]  Ching-Yung Lin,et al.  Video Collaborative Annotation Forum: Establishing Ground-Truth Labels on Large Multimedia Datasets , 2003, TRECVID.

[28]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[29]  Neil C. Rowe Inferring Depictions in Natural-Language Captions for Efficient Access to Picture Data , 1994, Inf. Process. Manag..

[30]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[32]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[33]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.