Combining Hierarchical Clustering and Machine Learning to Predict High-Level Discourse Structure

We propose a novel method to predict the interparagraph discourse structure of text, i.e. to infer which paragraphs are related to each other and form larger segments on a higher level. Our method combines a clustering algorithm with a model of segment "relatedness" acquired in a machine learning step. The model integrates information from a variety of sources, such as word co-occurrence, lexical chains, cue phrases, punctuation, and tense. Our method outperforms an approach that relies on word co-occurrence alone.

[1]  Alex Lascarides,et al.  Temporal interpretation, discourse relations and commonsense entailment , 1993, The Language of Time - A Reader.

[2]  Bonnie L. Webber,et al.  Discourse Deixis: Reference to Discourse Segments , 1988, ACL.

[3]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[4]  John A. Carroll,et al.  Applied morphological processing of English , 2001, Natural Language Engineering.

[5]  Heather A. Stark What do paragraph markings do , 1988 .

[6]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[7]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[8]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[9]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[10]  Daniel Marcu,et al.  Sentence Level Discourse Parsing using Syntactic and Lexical Information , 2003, NAACL.

[11]  William C. Mann,et al.  RHETORICAL STRUCTURE THEORY: A THEORY OF TEXT ORGANIZATION , 1987 .

[12]  Bonnie L. Webber,et al.  Tense as Discourse Anaphor , 1988, CL.

[13]  Chris Mellish,et al.  Beyond Elaboration: The Interaction of Relations and Focus in Coherent Text , 2000 .

[14]  Alistair Knott,et al.  A data-driven methodology for motivating a set of coherence relations , 1996 .

[15]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[16]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[17]  Yaakov Yaari,et al.  Segmentation of Expository Texts by Hierarchical Agglomerative Clustering , 1997, ArXiv.