Learning Recursive Segments for Discourse Parsing

Automatically detecting discourse segments is an important preliminary step towards full discourse parsing. Previous research on discourse segmentation have relied on the assumption that elementary discourse units (EDUs) in a document always form a linear sequence (i.e., they can never be nested). Unfortunately, this assumption turns out to be too strong, for some theories of discourse like SDRT allows for nested discourse units. In this paper, we present a simple approach to discourse segmentation that is able to produce nested EDUs. Our approach builds on standard multi-class classification techniques combined with a simple repairing heuristic that enforces global coherence. Our system was developed and evaluated on the first round of annotations provided by the French Annodis project (an ongoing effort to create a discourse bank for French). Cross-validated on only 47 documents (1,445 EDUs), our system achieves encouraging performance results with an F-score of 73% for finding EDUs.

[1]  Nicholas Asher,et al.  Annotation for and Robust Parsing of Discourse Structure on Unrestricted Texts , 2007 .

[2]  Eva I. Ejerhed,et al.  Finite state segmentation of discourse into clauses , 1996, Natural Language Engineering.

[3]  Mirella Lapata,et al.  Discourse Chunking and its Application to Sentence Compression , 2005, HLT.

[4]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[5]  D. Bourigault,et al.  Syntex, analyseur syntaxique de corpus , 2005 .

[6]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[7]  Josef van Genabith,et al.  Finite-State Methods and Natural Language Processing , 2005, Lecture Notes in Computer Science.

[8]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[9]  Brian Roark,et al.  The utility of parse-derived features for automatic discourse segmentation , 2007, ACL.

[10]  Nicholas Asher,et al.  Reference to abstract objects in discourse , 1993, Studies in linguistics and philosophy.

[11]  Rashmi Prasad,et al.  Annotation and Data Mining of the Penn Discourse TreeBank , 2004, ACL 2004.

[12]  Daniel Marcu,et al.  The rhetorical parsing of unrestricted texts: a surface-based approach , 2000, CL.

[13]  Maite Taboada,et al.  A Syntactic and Lexical-Based Discourse Segmenter , 2009, ACL.

[14]  Daniel Marcu,et al.  Sentence Level Discourse Parsing using Syntactic and Lexical Information , 2003, NAACL.

[15]  Xavier Carreras,et al.  Boosting trees for clause splitting , 2001, CoNLL.

[16]  William C. Mann,et al.  RHETORICAL STRUCTURE THEORY: A THEORY OF TEXT ORGANIZATION , 1987 .

[17]  Bonnie L. Webber,et al.  D-LTAG: extending lexicalized TAG to discourse , 2004, Cogn. Sci..