Large, Multilingual, Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT-2 and TDT-3 Corpus Efforts

This paper describes the creation and content two corpora, TDT-2 and TDT-3, created for the DARPA sponsored Topic Detection and Tracking project. The research goal in the TDT program is to create the core technology of a news understanding system that can process multilingual news content categorizing individual stories according to the topic(s) they describe. The research tasks include segmentation of the news streams into individual stories, detection of new topics, identification of the first story to discuss any topic, tracking of all stories on selected topics and detection of links among stories discussing the same topics. The corpora contain English and Chinese broadcast television and radio, newswires, and text from web sites devoted to news. For each source there are texts or text intermediaries; for the broadcast stories the audio is also available. Each broadcast is also segment to show start and end times of all news stories. LDC staff have defined news topics in the corpora and annotated each story to indicate its relevance to each topic. The end products are massive, richly annotated corpora available to support research and development in information retrieval, topic detection and tracking, information extraction message understanding directly or after additional annotation. This paper will describe the corpora created for TDT including sources, collection processes, formats, topic selection and definition, annotation, distribution and project management for large corpora.