Tracking topics in broadcast news data

This paper describes a topic tracking system and its ability to cope with sparse training data for broadcast news tracking. The baseline tracker which relies on a unigram topic model. In order to compensate for the very small amount of training data for each topic, document expansion is used in estimating the initial topic model, and unsupervised model adaptation is carried out after processing each test story. A new technique of variable weight unsupervised online adaptation has been developed and was found to outperform traditional fixed weight online adaptation. Combining both document expansion and adaptation resulted in a 37% cost reduction tested on both English and machine translated Mandarin broadcast news data transcribed by an ASR system, with manual story boundaries. Another challenging condition is one in which the story boundaries are not known for the broadcast news data. A window-based automatic story boundary detector has been developed for the tracking system. The tracking results with the window-based tracking system are comparable to those obtained with a state-of-the-art automatic story segmentation on the TDT3 corpus.