Wikipedia Based News Video Topic Modeling for Information Extraction

Determining the topic of a news video story (NVS) from its audio-visual footage is an important part of meta-data generation. In this paper we propose a news story topic modeling approach that takes advantage of online knowledge resources like Wikipedia to model the topic of a news story. A NVS is modeled as a distribution over several Wikipedia pages related to the story. The mapping of the NVS to a Wikipedia page table-of-contents (TOC) is also determined. The specific advantages of this topic modeling approach are. (1) The topic is interpretable as a weighted distribution over a set of semantically meaningful story title phrases instead of just being a collection of words. (2) It facilitates organizing news video stories as a taxonomy that captures several perspectives to the story. (3) The taxonomy facilitates exploration and non-linear search. Performance evaluations from an information extraction perspective validate the efficacy of the proposed topic modeling approach compared to TF-IDF and LDA based approaches on a large news video corpus.

[1]  James Allan,et al.  Evaluating topic models for information retrieval , 2008, CIKM '08.

[2]  Stephan Raaijmakers,et al.  A Cocktail Approach to the VideoCLEF'09 Linking Task , 2009, CLEF.

[3]  Carol Peters,et al.  Evaluating Systems for Multilingual and Multimodal Information Access, 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, September 17-19, 2008, Revised Selected Papers , 2009, CLEF.

[4]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[5]  Petra Perner,et al.  Advances in Data Mining , 2002, Lecture Notes in Computer Science.

[6]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[7]  Liang-Tien Chia,et al.  Faceted topic retrieval of news video using joint topic modeling of visual features and speech transcripts , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[8]  Maximilian Eibl,et al.  VideoCLEF 2008: ASR Classification with Wikipedia Categories , 2008, CLEF.

[9]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[10]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[11]  Ben Carterette,et al.  Probabilistic models of ranking novel documents for faceted topic retrieval , 2009, CIKM.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.