Visual concepts for news story tracking: analyzing and exploiting the NIST TRESVID video annotation experiment

In the summer of 2003, using an interactive intelligent tool, over 100 researchers in video understanding annotated from the NIST TRECVID database over 62 hours of news video spanning six months of 1998. These 47K shots with 43 3 K labels from over 1000 visual concept categories comprise the largest publicly available ground truth for this domain. Our analysis of this data, combining the tools of statistical natural language processing, machine learning, and computer vision, finds significant novel statistical patterns that can be exploited for the accurate tracking of the episodes of a given news story over time, by using semantic labels that are solely visual. We find that the ground "truth" is very muddy, but by using the feature selection tool of information gain, we extract 14 reliable visual concepts with mid-frequency use; all but one are visual concepts that refer to settings, rather than actors, objects, or events. We discover that the probability of another episode of a named story to recur after a gap of d days is proportional to 1/(d + 1). We define a novel similarity measure incorporating both semantic and temporal properties between episodes i and j as: Dice(i, j)/(1 + gap(i, j)). We exploit a low-level computer vision technique, normalized cut (Laplacian eigenmaps), for clustering these episodes into stories, and in the process document a weakness of this popular technique. We use these empirical results to make specific recommendations on how better visual semantic ontologies for news stories, and how better video annotation tools, should be designed.