Automatic Paragraph Identification: A Study across Languages and Domains

In this paper we investigate whether paragraphs can be identified automatically in different languages and domains. We propose a machine learning approach which exploits textual and discourse cues and we assess how well humans perform on this task. Our best models achieve an accuracy that is significantly higher than the best baseline and, for most data sets, comes to within 6% of human performance.

[1]  Eugene Charniak,et al.  Immediate-Head Parsing for Language Models , 2001, ACL.

[2]  Heather A. Stark What do paragraph markings do , 1988 .

[3]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[4]  Heidi Christensen,et al.  From Text Summarisation to Style-Specific Summarisation for Broadcast News , 2004, ECIR.

[5]  Eugene Charniak,et al.  Variation of Entropy and Parse Trees of Sentences as a Function of the Sentence Number , 2003, EMNLP.

[6]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[7]  Mark Stevenson,et al.  Experiments on Sentence Boundary Detection , 2000, ANLP.

[8]  A. Stuart,et al.  Non-Parametric Statistics for the Behavioral Sciences. , 1957 .

[9]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[10]  Robert E. Longacre,et al.  The Paragraph as a Grammatical Unit , 1979 .

[11]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[12]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[13]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[14]  Chunshen Zhu UT once more: the sentence as the key functional unit of translation: the sentence as the key functional unit of translation , 1999 .

[15]  J. Milton,et al.  Language Independent Authorship Attribution using Character Level Language Models , 2003 .

[16]  Alistair Knott,et al.  A data-driven methodology for motivating a set of coherence relations , 1996 .

[17]  Alexander G. Hauptmann,et al.  Text, Speech, and Vision for Video Segmentation: The InformediaTM Project , 1995 .