Story Segmentation and Topic Classification of Broadcast News via a Topic-Based Segmental Model and a Genetic Algorithm

This paper presents a two-stage approach to story segmentation and topic classification of broadcast news. The two-stage paradigm adopts a decision tree and a maximum entropy model to identify the potential story boundaries in the broadcast news within a sliding window. The problem for story segmentation is thus transformed to the determination of a boundary position sequence from the potential boundary regions. A genetic algorithm is then applied to determine the chromosome, which corresponds to the final boundary position sequence. A topic-based segmental model is proposed to define the fitness function applied in the genetic algorithm. The syllable- and word-based story segmentation schemes are adopted to evaluate the proposed approach. Experimental results indicate that a miss probability of 0.1587 and a false alarm probability of 0.0859 are achieved for story segmentation on the collected broadcast news corpus. On the TDT-3 Mandarin audio corpus, a miss probability of 0.1232 and a false alarm probability of 0.1298 are achieved. Moreover, an outside classification accuracy of 74.55% is obtained for topic classification on the collected broadcast news, while an inside classification accuracy of 88.82% is achieved on the TDT-2 Mandarin audio corpus.

[1]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[2]  Michael J. Witbrock,et al.  Story segmentation and detection of commercials in broadcast news video , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[3]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[4]  Tak-Chung Fu,et al.  An evolutionary approach to pattern-based time series segmentation , 2004, IEEE Transactions on Evolutionary Computation.

[5]  Chung-Hsien Wu,et al.  Multiple change-point audio segmentation and classification using an MDL-based Gaussian model , 2006, IEEE Trans. Speech Audio Process..

[6]  Alan F. Smeaton,et al.  SeLeCT: a lexical cohesion based news story segmentation system , 2004, AI Commun..

[7]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[8]  Tomas E. Ward,et al.  Segmentation and detection at IBM: Hybrid statistical models and two-tiered clustering broadcast new , 2000 .

[9]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[10]  Gökhan Tür,et al.  Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation , 2001, CL.

[11]  Satya Dharanipragada,et al.  Segmentation and Detection at IBM , 2002 .

[12]  Chung-Hsien Wu,et al.  Multi-keyword spotting of telephone speech using a fuzzy search algorithm and keyword-driven two-level CBSM , 2001, Speech Commun..

[13]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[14]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[15]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[16]  Larry Gillick,et al.  Text segmentation and topic tracking on broadcast news via a hidden Markov model approach , 1998, ICSLP.

[17]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[18]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[19]  Hang Joon Kim,et al.  Spatiotemporal segmentation using genetic algorithms , 2001, Pattern Recognit..

[20]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[21]  Zbigniew Michalewicz,et al.  Genetic algorithms + data structures = evolution programs (3rd ed.) , 1996 .

[22]  Lin-Shan Lee,et al.  Retrieval of mandarin broadcast news using spoken queries , 2000, INTERSPEECH.

[23]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[24]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[25]  Sancho Salcedo-Sanz,et al.  Offline speaker segmentation using genetic algorithms and mutual information , 2006, IEEE Transactions on Evolutionary Computation.

[26]  Daben Liu,et al.  Speech and language technologies for audio indexing and retrieval , 2000, Proceedings of the IEEE.

[27]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[28]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[29]  Chung-Hsien Wu,et al.  Speech act modeling and verification of spontaneous speech with disfluency in a spoken dialogue system , 2005, IEEE Transactions on Speech and Audio Processing.

[30]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[31]  Chang Wook Ahn,et al.  A genetic algorithm for shortest path routing problem and the sizing of populations , 2002, IEEE Trans. Evol. Comput..