Unsupervised broadcast news story segmentation using distance dependent Chinese restaurant processes

Traditional unsupervised broadcast news story segmentation approaches have to set the segmentation number manually, while this number is often unknown in real-world applications. In this paper, we solve this problem by modeling the generative process of stories as distance dependent Chinese restaurant process (dd-CRP) mixtures. We cut a news program into fixed-size text blocks and consider these blocks in the same story are generated from a story-specific topic. Specifically, we add a dd-CRP prior which has an essential bias that the blocks' topic is more likely to be the same with the nearby blocks. Subsequently, story boundaries can be found by detecting the changes of topics. Experiments show that our approach outperforms both supervised and unsupervised approaches and the segmentation number can be automatically learned from data.

[1]  Gökhan Tür,et al.  Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation , 2001, CL.

[2]  Samuel J. Gershman,et al.  A Tutorial on Bayesian Nonparametric Models , 2011, 1106.2697.

[3]  Julia Hirschberg,et al.  Story Segmentation of Broadcast News in English, Mandarin and Arabic , 2006, NAACL.

[4]  Bin Ma,et al.  Phoneme lattice based texttiling towards multilingual story segmentation , 2010, INTERSPEECH.

[5]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[6]  Igor Malioutov,et al.  Minimum Cut Model for Spoken Lecture Segmentation , 2006, ACL.

[7]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[8]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[9]  Philip Resnik,et al.  SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations , 2012, ACL.

[10]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[11]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[12]  Jen-Tzung Chien,et al.  Topic-Based Hierarchical Segmentation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Peter I. Frazier,et al.  Distance dependent Chinese restaurant processes , 2009, ICML.

[14]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[15]  Regina Barzilay,et al.  Bayesian Unsupervised Topic Segmentation , 2008, EMNLP.

[16]  Thomas L. Griffiths,et al.  Unsupervised Topic Modelling for Multi-Party Spoken Discourse , 2006, ACL.

[17]  Alexander I. Rudnicky,et al.  A texttiling based approach to topic boundary detection in meetings , 2006, INTERSPEECH.

[18]  Soumya Ghosh,et al.  From Deformations to Parts: Motion-based Segmentation of 3D Objects , 2012, NIPS.

[19]  Soumya Ghosh,et al.  Spatial distance dependent Chinese restaurant processes for image segmentation , 2011, NIPS.

[20]  Bin Ma,et al.  Probabilistic Latent Semantic Analysis for Broadcast News Story Segmentation , 2011, INTERSPEECH.

[21]  Johanna D. Moore,et al.  Latent Semantic Analysis for Text Segmentation , 2001, EMNLP.