SeLeCT: a lexical cohesion based news story segmentation system

In this paper we compare the performance of three distinct approaches to lexical cohesion based text segmentation. Most work in this area has focused on the discovery of textual units that discuss subtopic structure within documents. In contrast our segmentation task requires the discovery of topical units of text i.e., distinct news stories from broadcast news programmes. Our approach to news story segmentation (the SeLeCT system) is based on an analysis of lexical cohesive strength between textual units using a linguistic technique called lexical chaining. We evaluate the relative performance of SeLeCT with respect to two other cohesion based segmenters: TextTiling and C99. Using a recently introduced evaluation metric WindowDiff, we contrast the segmentation accuracy of each system on both "spoken" (CNN news transcripts) and "written" (Reuters newswire) news story test sets extracted from the TDT1 corpus.

[1]  Michael Halliday,et al.  Cohesion in English , 1976 .

[2]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[3]  Rebecca J. Passonneau,et al.  Intention-Based Segmentation: Human Reliability and Correlation with Linguistic Cues , 1993, ACL.

[4]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[5]  Manabu Okumura,et al.  Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion , 1994, COLING.

[6]  Okumura Manabu,et al.  Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion , 1994, COLING.

[7]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[8]  David St-Onge,et al.  Detecting and Correcting Malapropisms with Lexical Chains , 1995 .

[9]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[10]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[11]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[12]  E. Hawkins Spoken and written language , 1985, Science.

[13]  Mitchell P. Marcus,et al.  Topic segmentation: algorithms and applications , 1998 .

[14]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[15]  Kathleen R. McKeown,et al.  Linear segmentation and segment relevence , 1998 .

[16]  Salim Roukos,et al.  Story Segmentation and Topic Detection in the Broadcast News Domain , 1999 .

[17]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[18]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[19]  Alan F. Smeaton,et al.  Segmenting broadcast news streams using lexical chains , 2002 .

[20]  Nicola Stokes,et al.  Spoken and Written News Story Segmentation Using Lexical Chains , 2003, NAACL.

[21]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.