Robust discourse parsing via discourse markers, topicality and position

This paper describes a simple discourse parsing and analysis algorithm that combines a formal underspecification utilising discourse grammar with Information Retrieval (IR) techniques. First, linguistic knowledge based on discourse markers is used to constrain a totally underspecified discourse representation. Then, the remaining underspecification is further specified by the computation of a topicality score for every discourse unit. This computation is done via the vector space model. Finally, the sentences in a prominent position (e.g. the first sentence of a paragraph) are given an adjusted topicality score. The proposed algorithm was evaluated by applying it to a text summarisation task. Results from a psycholinguistic experiment, indicating the most salient sentences for a given text as the ‘gold standard’, show that the algorithm performs better than commonly used machine learning and statistical approaches to summarisation.

[1]  Bonnie L. Webber,et al.  Describing discourse semantics , 1998, TAG+.

[2]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[3]  Sarah Louise Oates Multiple Discourse Marker Occurrence: Creating Hierarchies for Natural Language Generation , 2000, ANLP.

[4]  Kenji Ono,et al.  A Discourse Structure Analyzer for Japanese Text , 1992, Fifth Generation Computer Systems.

[5]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[6]  Uwe Reyle,et al.  Dealing with Ambiguities by Underspecification: Construction, Representation and Deduction , 1993, J. Semant..

[7]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[8]  Daniel Marcu,et al.  From discourse structures to text summaries , 1997 .

[9]  Matthew Stone,et al.  Discourse Relations: A Structural and Presuppositional Account Using Lexicalised TAG , 1999, ACL.

[10]  Johan Bos,et al.  Predicate logic unplugged , 1996 .

[11]  Frank Schilder An Underspecified Segmented Discourse Representation Theory (USDRT) , 1998, COLING-ACL.

[12]  S. Corston-Oliver Beyond String Matching and Cue Phrases: Improving Efficiency and Coverage in Discourse Analysis , 1998 .

[13]  Daniel Marcu Improving summarization through rhetorical parsing tuning , 1998, VLC@COLING/ACL.

[14]  Daniel Marcu,et al.  A Decision-Based Approach to Rhetorical Parsing , 1999, ACL.

[15]  Nicholas Asher,et al.  Reference to abstract objects in discourse , 1993, Studies in linguistics and philosophy.

[16]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[17]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[18]  Claudia Soria,et al.  Lexical marking of discourse relations - some experimental findings , 1998, COLING 1998.

[19]  Eduard H. Hovy,et al.  Identifying Topics by Position , 1997, ANLP.