Features for automatic discourse analysis of paragraphs

In this paper, we investigate which information is useful for the detection of rhetorical (RST) relations between (Multi-) Sentential Discourse Units ((M-)SDUs)–text spans consisting of one or more sentences–within the same paragraph. In order to do so, we simplified the task of discourse parsing to a decision problem in which we decided whether an (M-)SDU is either rhetorically related to a preceding or a following (M-)SDU. Employing the RST Treebank (Carlson et al. 2003), we offered this choice to machine learning algorithms together with syntactic, lexical, referential, discourse and surface features. Next, the features were ranked on the basis of (1) models established by the classification algorithms and (2) feature selection metrics. Highly ranked features that predict the presence of a rhetorical relation are syntactic similarity, word overlap, word similarity, continuous punctuation and many reference features. Other features are used to introduce new topics or arguments: time references, proper nouns, definite articles and the word further.

[1]  Martin van den Berg,et al.  A Rule Based Approach to Discourse Parsing , 2004, SIGDIAL Workshop.

[2]  P MarcusMitchell,et al.  Building a large annotated corpus of English , 1993 .

[3]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[4]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[5]  David Reitter,et al.  Step by step: underspecified markup in incremental rhetorical analysis , 2003, LINC@EACL.

[6]  Zhang Le,et al.  Maximum Entropy Modeling Toolkit for Python and C , 2004 .

[7]  Daniel Marcu,et al.  The rhetorical parsing of unrestricted texts: a surface-based approach , 2000, CL.

[8]  Asanee Kawtrakul,et al.  Thai Discourse Relations Recognition By Using naive Bayes Classifier , 2005 .

[9]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[10]  Richard Evans,et al.  A New, Fully Automatic Version of Mitkov's Knowledge-Poor Pronoun Resolution Method , 2002, CICLing.

[11]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[12]  Frank Schilder,et al.  Robust discourse parsing via discourse markers, topicality and position , 2002, Natural Language Engineering.

[13]  Donia Scott,et al.  Computational Approaches to Discourse and Document Processing , 2006, TAL.

[14]  R. H. Baayen,et al.  The CELEX Lexical Database (CD-ROM) , 1996 .

[15]  Michael ODonnell,et al.  RSTTool 2.4 - A markup Tool for Rhetorical Structure Theory , 2000, INLG.

[16]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[17]  Massimo Poesio,et al.  A General-Purpose, Off-the-shelf Anaphora Resolution Module: Implementation and Preliminary Evaluation , 2004, LREC.

[18]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[19]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[20]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[21]  Daniel Marcu,et al.  A Decision-Based Approach to Rhetorical Parsing , 1999, ACL.

[22]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[23]  Wauter Bosma Query-Based Summarization using Rhetorical Structure Theory , 2004, CLIN.

[24]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[25]  Daniel Marcu,et al.  Sentence Level Discourse Parsing using Syntactic and Lexical Information , 2003, NAACL.

[26]  Manfred Stede,et al.  Machine-Assisted Rhetorical Structure Annotation , 2004, COLING.

[27]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[28]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[29]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[30]  Geoffrey Sampson,et al.  Natural language analysis by stochastic optimization: a progress report on Project APRIL , 1990, J. Exp. Theor. Artif. Intell..

[31]  Bonnie L. Webber,et al.  D-LTAG: extending lexicalized TAG to discourse , 2004, Cogn. Sci..

[32]  Fenguangzhai Song CD , 1992 .

[33]  Renata Vieira,et al.  An Empirically-based System for Processing Definite Descriptions , 2000, CL.

[34]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[35]  Justin Zobel,et al.  Redundant documents and search effectiveness , 2005, CIKM '05.

[36]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[37]  Hwee Tou Ng,et al.  A Machine Learning Approach to Answering Questions for Reading Comprehension Tests , 2000, EMNLP.