In the summer of 2003, using an interactive intelligent tool, over 100 researchers in video understanding annotated from the NIST TRECVID database over 62 hours of news video spanning six months of 1998. These 47K shots with 43 3 K labels from over 1000 visual concept categories comprise the largest publicly available ground truth for this domain. Our analysis of this data, combining the tools of statistical natural language processing, machine learning, and computer vision, finds significant novel statistical patterns that can be exploited for the accurate tracking of the episodes of a given news story over time, by using semantic labels that are solely visual. We find that the ground "truth" is very muddy, but by using the feature selection tool of information gain, we extract 14 reliable visual concepts with mid-frequency use; all but one are visual concepts that refer to settings, rather than actors, objects, or events. We discover that the probability of another episode of a named story to recur after a gap of d days is proportional to 1/(d + 1). We define a novel similarity measure incorporating both semantic and temporal properties between episodes i and j as: Dice(i, j)/(1 + gap(i, j)). We exploit a low-level computer vision technique, normalized cut (Laplacian eigenmaps), for clustering these episodes into stories, and in the process document a weakness of this popular technique. We use these empirical results to make specific recommendations on how better visual semantic ontologies for news stories, and how better video annotation tools, should be designed.
[1]
Mark Liberman,et al.
THE TDT-2 TEXT AND SPEECH CORPUS
,
1999
.
[2]
James Allan,et al.
UMASS Approaches to Detection and Tracking at TDT2
,
1999
.
[3]
Hinrich Schütze,et al.
Book Reviews: Foundations of Statistical Natural Language Processing
,
1999,
CL.
[4]
Ching-Yung Lin,et al.
Video Collaborative Annotation Forum: Establishing Ground-Truth Labels on Large Multimedia Datasets
,
2003,
TRECVID.
[5]
Lee Wilkins.
Deciding What's News: A Study of CBS Evening News, NBC Nightly News, Newsweek, and Time
,
2005
.
[6]
Ted Dunning,et al.
Accurate Methods for the Statistics of Surprise and Coincidence
,
1993,
CL.
[7]
Michael I. Jordan,et al.
Feature selection for high-dimensional genomic microarray data
,
2001,
ICML.
[8]
Jitendra Malik,et al.
Normalized cuts and image segmentation
,
1997,
Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[9]
Shih-Fu Chang,et al.
Detecting image near-duplicate by stochastic attributed relational graph matching with learning
,
2004,
MULTIMEDIA '04.
[10]
Anil K. Jain,et al.
Algorithms for Clustering Data
,
1988
.