Using linguistically motivated features for paragraph boundary identification

In this paper we propose a machine-learning approach to paragraph boundary identification which utilizes linguistically motivated features. We investigate the relation between paragraph boundaries and discourse cues, pronominalization and information structure. We test our algorithm on German data and report improvements over three baselines including a reimplementation of Sporleder & Lapata's (2006) work on paragraph segmentation. An analysis of the features' contribution suggests an interpretation of what paragraph boundaries indicate and what they depend on.

[1]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[2]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[3]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[4]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[5]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner , 2007 .

[6]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[7]  Mirella Lapata,et al.  Automatic Paragraph Identification: A Study across Languages and Domains , 2004, EMNLP.

[8]  Dmitriy Genzel A Paragraph Boundary Detection System , 2005, CICLing.

[9]  John A. Bateman,et al.  Rhetorical structure theory , 2006 .

[10]  Heather A. Stark What do paragraph markings do , 1988 .

[11]  Donia Scott,et al.  Document Structure , 2003, CL.

[12]  William C. Mann,et al.  Rhetorical Structure Theory: A Framework for the Analysis of Texts , 1987 .

[13]  Alexander F. Gelbukh,et al.  Text Segmentation into Paragraphs Based on Local Text Cohesion , 2001, TSD.

[14]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner, version 2.0, Reference guide , 1998 .

[15]  Eugene Charniak,et al.  Variation of Entropy and Parse Trees of Sentences as a Function of the Sentence Number , 2003, EMNLP.

[16]  Michael Strube,et al.  Beyond the Pipeline: Discrete Optimization in NLP , 2005, CoNLL.

[17]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[18]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[19]  Mirella Lapata,et al.  Broad coverage paragraph segmentation across languages and domains , 2006, TSLP.

[20]  Garland Cannon,et al.  The Holt guide to English , 1972 .

[21]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[22]  Dan Roth,et al.  A Linear Programming Formulation for Global Inference in Natural Language Tasks , 2004, CoNLL.

[23]  A HearstMarti,et al.  A critique and improvement of an evaluation metric for text segmentation , 2002 .

[24]  Wolfgang Menzel,et al.  Robust Parsing: More with Less , 2006, Workshop On ROMAND Robust Methods In Analysis Of Natural Language Data.