Unsupervised topic discovery applied to segmentation of news transcriptions

Audio transcriptions from Automatic Speech Recognition systems are a continuous stream of words that are difficult to read. Segmenting these transcriptions into thematically distinct stories and categorizing the stories by topics increases readability and comprehensibility. However, manually defined topic categories are rarely available, and the cost of annotating a large corpus with thousands of distinct topics is high. We describe a procedure for applying the Unsupervised Topic Discovery (UTD) algorithm to the Thematic Story Segmentation procedure for segmenting broadcast news episodes into stories and to assign these stories with automatic topic labels. We report our results of applying automatic topics for the task of story segmentation on a collection of news episodes in English and Arabic. Our results indicate that story segmentation performance with automatic topic annotations from UTD is at par with the performance with manual topic annotations. Fig In s problem describe segmen In Secti the sto annotati Arabic.