Improvement of the dotplotting method for linear text segmentation

The dotplotting method, employed by Reynar (1994), is a state-of-the-art algorithm for automatic linear text segmentation. However, several problems are found in its measure for assessing density that represents topical coherence: the density function is asymmetric, leading to the apparent false conclusion that forward scan may result in different segmentation with backward scan; besides, while determining next boundary, the assessing strategy doesn't adequately take the previously located boundaries into account. In this paper we propose modified models that remedy these problems. We also make use of segment length to improve segmentation performance. Experimental results show that the modified models achieve considerable improvement in P/sub k/ value and precision and recall over the original dotplotting method.

[1]  Min-Yen Kan,et al.  Linear Segmentation and Segment Significance , 1998, VLC@COLING/ACL.

[2]  Rebecca J. Passonneau,et al.  Intention-Based Segmentation: Human Reliability and Correlation with Linguistic Cues , 1993, ACL.

[3]  Jeffrey C. Reynar Statistical Models for Topic Segmentation , 1999, ACL.

[4]  John D. Lafferty,et al.  Text Segmentation Using Exponential Models , 1997, EMNLP.

[5]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[6]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[7]  Benjamin Ka-Yin T'sou,et al.  Segmentation of Chinese Discourse in Content-Based Information Retrieval , 2000, RIAO.

[8]  Mitchell P. Marcus,et al.  Topic segmentation: algorithms and applications , 1998 .

[9]  Min-Yen Kan,et al.  Combining Visual Layout and Lexical Cohesion Features for Text Segmentation , 2001 .

[10]  Oskari Heinonen,et al.  Optimal Multi-Paragraph Text Segmentation by Dynamic Programming , 1998, ACL.

[11]  Athanasios Kehagias,et al.  A Dynamic Programming Algorithm for Linear Text Segmentation , 2004, Journal of Intelligent Information Systems.

[12]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[13]  Andrew Smith,et al.  Detecting Subject Boundaries Within Text: A Language Independent Statistical Approach , 1997, EMNLP.

[14]  G. Youmans A New Tool for Discourse Analysis: The Vocabulary-Management Profile. , 1991 .

[15]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[16]  Julia Hirschberg,et al.  Empirical Studies on the Disambiguation of Cue Phrases , 1993, Comput. Linguistics.

[17]  Jeffrey C. Reynar An Automatic Method of Finding Topic Boundaries , 1994, ACL.

[18]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[19]  Kenneth Ward Church Char_align: A Program for Aligning Parallel Texts at the Character Level , 1993, ACL.

[20]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[21]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[22]  Na Ye,et al.  Using Multiple Discriminant Analysis Approach for Linear Text Segmentation , 2005, IJCNLP.

[23]  David M. Blei,et al.  Topic segmentation with an aspect hidden Markov model , 2001, SIGIR '01.

[24]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[25]  John Lafferty,et al.  A Model of Lexical Attraction and Repulsion , 1997, Annual Meeting of the Association for Computational Linguistics.