Text Segmentation into Paragraphs Based on Local Text Cohesion

The problem of automatic text segmentation is subcategorized into two different problems: thematic segmentation into rather large topically self-contained sections and splitting into paragraphs, i.e., lexico-grammatical segmentation of lower level. In this paper we consider the latter problem. We propose a method of reasonably splitting text into paragraph based on a text cohesion measure. Specifically, we propose a method of quantitative evaluation of text cohesion based on a large linguistic resource - a collocation network. At each step, our algorithm compares word occurrences in a text against a large DB of collocations and semantic links between words in the given natural language. The procedure consists in evaluation of the cohesion function, its smoothing, normalization, and comparing with a specially constructed threshold.

[1]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[2]  Wlodek Zadrozny,et al.  Semantics of Paragraphs , 1991, Comput. Linguistics.

[3]  Makoto Nagao,et al.  Automatic Detection of Discourse Structure by Checking Surface Information in Sentences , 1994, COLING.

[4]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[5]  Lindsay J. Evett,et al.  Text Segmentation Using Reiteration and Collocation , 1998, COLING-ACL.

[6]  Piek Vossen,et al.  EuroWordNet: general document , 2002 .

[7]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[8]  Gerard Salton,et al.  Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[9]  Brigitte Grau,et al.  Thematic segmentation of texts: two methods for two kinds of texts , 1998, COLING.

[10]  Igor A. Bolshakov Multifunction Thesaurus For Russian Word Processing , 1994, ANLP.

[11]  Tadashi Nomoto,et al.  A Grammatico-Statistical Approach to Discourse Partitioning , 1994, COLING.

[12]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[13]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[14]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[15]  Alan V. Oppenheim,et al.  Discrete-time Signal Processing. Vol.2 , 2001 .

[16]  Rebecca J. Passonneau,et al.  Combining Multiple Knowledge Sources for Discourse Segmentation , 1995, ACL.

[17]  Oskari Heinonen,et al.  Optimal Multi-Paragraph Text Segmentation by Dynamic Programming , 1998, ACL.

[18]  Stefan Kaufmann Second‐Order Cohesion , 2000, Comput. Intell..