Topic Labeling of Multilingual Broadcast News in the Informedia Digital Video Library

The Informedia Digital Video Library Project includes a multilingual component for retrieval of video documents in multiple languages and a topic-labeling component for English video documents. We now extend this capability to English topic labeling of foreign-language broadcast-news stories. News stories are coarsely machine-translated into English, then assigned to a topic category using a K-nearest-neighbor algorithm. In preliminary tests on Croatian television news, topic assignment based on the best available machine translation technology showed performance only 8% worse (on a standard F-measure of performance) than that based on manual document translation. Using a phrase-based MT module the performance degradation was 31%. 1 The Informedia Digital Video Library The Informedia Digital Library Project [1,2] allows full content indexing and retrieval of text, audio and video material, similar to what is available today for text only. To enable this access to video, speech recognition is used to provide a text transcript for the audio track, image processing determines scene boundaries, recognizes faces and allows for image similarity comparisons. Everything is indexed into a searchable digital video library [4,6], where users can submit queries and retrieve relevant news stories as results. News-on-Demand is a particular collection in the Informedia Digital Library that has served as a test-bed for automatic library creation techniques. As of July 1998, the Informedia project had about 1.3 terabytes of news video indexed and accessible online, with 1200 news broadcasts containing 24000 news stories. The Informedia digital video library system has two distinct subsystems: the Library Creation System and the Library Exploration Client. The library creation system runs every night, automatically capturing, processing and adding current news shows to the library. It is during the library creation phase, that topics for news stories are automatically assigned to incoming stories. In [17], we described and evaluated tested a topic labeling component for the English language version of the Informedia Digital Video Library. During library exploration, the user can browse or search these stories and topics using the library exploration client. At 5 topics, the KNN-based system’s recall was 0.49; and relevance was 0.48, with an F-measure at equal recall and precision of about 0.48.