论文信息 - Story Segmentation and Topic Classification of Broadcast News via a Topic-Based Segmental Model and a Genetic Algorithm

Story Segmentation and Topic Classification of Broadcast News via a Topic-Based Segmental Model and a Genetic Algorithm

This paper presents a two-stage approach to story segmentation and topic classification of broadcast news. The two-stage paradigm adopts a decision tree and a maximum entropy model to identify the potential story boundaries in the broadcast news within a sliding window. The problem for story segmentation is thus transformed to the determination of a boundary position sequence from the potential boundary regions. A genetic algorithm is then applied to determine the chromosome, which corresponds to the final boundary position sequence. A topic-based segmental model is proposed to define the fitness function applied in the genetic algorithm. The syllable- and word-based story segmentation schemes are adopted to evaluate the proposed approach. Experimental results indicate that a miss probability of 0.1587 and a false alarm probability of 0.0859 are achieved for story segmentation on the collected broadcast news corpus. On the TDT-3 Mandarin audio corpus, a miss probability of 0.1232 and a false alarm probability of 0.1298 are achieved. Moreover, an outside classification accuracy of 74.55% is obtained for topic classification on the collected broadcast news, while an inside classification accuracy of 88.82% is achieved on the TDT-2 Mandarin audio corpus.

Chung-Hsien Wu | Chia-Hsin Hsieh

[1] John D. Lafferty,et al. Statistical Models for Text Segmentation , 1999, Machine Learning.

[2] Michael J. Witbrock,et al. Story segmentation and detection of commercials in broadcast news video , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[3] Gökhan Tür,et al. Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[4] Tak-Chung Fu,et al. An evolutionary approach to pattern-based time series segmentation , 2004, IEEE Transactions on Evolutionary Computation.

[5] Chung-Hsien Wu,et al. Multiple change-point audio segmentation and classification using an MDL-based Gaussian model , 2006, IEEE Trans. Speech Audio Process..

[6] Alan F. Smeaton,et al. SeLeCT: a lexical cohesion based news story segmentation system , 2004, AI Commun..

[7] Marti A. Hearst. Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[8] Tomas E. Ward,et al. Segmentation and detection at IBM: Hybrid statistical models and two-tiered clustering broadcast new , 2000 .

[9] Yiming Yang,et al. Topic Detection and Tracking Pilot Study Final Report , 1998 .

[10] Gökhan Tür,et al. Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation , 2001, CL.

[11] Satya Dharanipragada,et al. Segmentation and Detection at IBM , 2002 .

[12] Chung-Hsien Wu,et al. Multi-keyword spotting of telephone speech using a fuzzy search algorithm and keyword-driven two-level CBSM , 2001, Speech Commun..

[13] Slava M. Katz,et al. Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[14] Yiming Yang,et al. Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[15] W. Bruce Croft,et al. Text Segmentation by Topic , 1997, ECDL.

[16] Larry Gillick,et al. Text segmentation and topic tracking on broadcast news via a hidden Markov model approach , 1998, ICSLP.

[17] Zbigniew Michalewicz,et al. Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[18] Hitoshi Isahara,et al. A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[19] Hang Joon Kim,et al. Spatiotemporal segmentation using genetic algorithms , 2001, Pattern Recognit..

[20] J.R. Bellegarda,et al. Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[21] Zbigniew Michalewicz,et al. Genetic algorithms + data structures = evolution programs (3rd ed.) , 1996 .

[22] Lin-Shan Lee,et al. Retrieval of mandarin broadcast news using spoken queries , 2000, INTERSPEECH.

[23] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[24] David D. Lewis,et al. Representation and Learning in Information Retrieval , 1991 .

[25] Sancho Salcedo-Sanz,et al. Offline speaker segmentation using genetic algorithms and mutual information , 2006, IEEE Transactions on Evolutionary Computation.

[26] Daben Liu,et al. Speech and language technologies for audio indexing and retrieval , 2000, Proceedings of the IEEE.

[27] Anil K. Jain,et al. Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[28] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[29] Chung-Hsien Wu,et al. Speech act modeling and verification of spontaneous speech with disfluency in a spoken dialogue system , 2005, IEEE Transactions on Speech and Audio Processing.

[30] David D. Lewis,et al. A comparison of two learning algorithms for text categorization , 1994 .

[31] Chang Wook Ahn,et al. A genetic algorithm for shortest path routing problem and the sizing of populations , 2002, IEEE Trans. Evol. Comput..