Discourse Segmentation of German Texts

This paper addresses the problem of segmenting German texts into minimal discourse units, as they are needed, for example, in RST-based discourse parsing. We discuss relevant variants of the problem, introduce the design of our annotation guidelines, and provide the results of an extensive interannotator agreement study of the corpus. Afterwards, we report on our experiments with three automatic classifiers that rely on the output of state-of-the-art parsers and use different amounts and kinds of syntactic knowledge: constituent parsing versus dependency parsing; tree-structure classification versus sequence labeling. Finally, we compare our approaches with the recent discourse segmentation methods proposed for English.

[1]  Graeme Hirst,et al.  Two-pass Discourse Segmentation with Pairing and Global Features , 2014, ArXiv.

[2]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[3]  Danushka Bollegala,et al.  A Sequential Model for Discourse Segmentation , 2010, CICLing.

[4]  Bernd Bohnet,et al.  Very high accuracy and fast dependency parsing is not a contradiction , 2010, COLING 2010.

[5]  Manfred Stede,et al.  Discourse Processing , 2011, NAACL.

[6]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[7]  Eleni Miltsakaki,et al.  Toward an Aposynthesis of Topic Continuity and Intrasentential Anaphora , 2002, Computational Linguistics.

[8]  Thomas C. Schmidt,et al.  New and future developments in EXMARaLDA , 2011 .

[9]  Harald Lüngen,et al.  Discourse Segmentation of German Written Texts , 2006, FinTAL.

[10]  Joel R. Tetreaul,et al.  A Corpus-Based Evaluation of Centering and Pronoun Resolution , 2001, CL.

[11]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[12]  Maite Taboada,et al.  Deciding on units of analysis within Centering Theory , 2008 .

[13]  John Nerbonne Lexikon der Sprachwissenschaft , 1984 .

[14]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[15]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[16]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[17]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[18]  Daniel Marcu,et al.  Sentence Level Discourse Parsing using Syntactic and Lexical Information , 2003, NAACL.

[19]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[20]  Jacob Eisenstein,et al.  Representation Learning for Text-level Discourse Parsing , 2014, ACL.

[21]  Maite Taboada,et al.  A Syntactic and Lexical-Based Discourse Segmenter , 2009, ACL.

[22]  Akira Shimazu,et al.  A Reranking Model for Discourse Segmentation using Subtree Features , 2012, SIGDIAL Conference.

[23]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[24]  Alex Lascarides,et al.  Logics of Conversation , 2005, Studies in natural language processing.

[25]  Manfred Stede,et al.  The role of illocutionary status in the usage conditions of causal connectives and in coherence relations , 2012 .

[26]  Scott Weinstein,et al.  Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[27]  Ido Dagan,et al.  Synthesis Lectures on Human Language Technologies , 2009 .

[28]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[29]  Pascal Denis,et al.  Learning Recursive Segments for Discourse Parsing , 2010, LREC.

[30]  Chris Fournier,et al.  Evaluating Text Segmentation using Boundary Edit Distance , 2013, ACL.

[31]  George Forman,et al.  Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement , 2010, SKDD.

[32]  Brian Roark,et al.  The utility of parse-derived features for automatic discourse segmentation , 2007, ACL.

[33]  Pascal Denis,et al.  Constrained Decoding for Text-Level Discourse Parsing , 2012, COLING.

[34]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[35]  Yannick Versley,et al.  Linguistic Tests for Discourse Relations in the TüBa-D/Z Corpus of Written German , 2012 .

[36]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[37]  Mirella Lapata,et al.  Discourse Chunking and its Application to Sentence Compression , 2005, HLT.

[38]  Christian R. Huyck,et al.  Generating Discourse Structures for Written Text , 2004, COLING.

[39]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[40]  Helmut Schmid Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors , 2004, COLING.

[41]  Manfred Stede,et al.  Potsdam Commentary Corpus 2.0: Annotation for Discourse Research , 2014, LREC.

[42]  Shalom Lappin,et al.  An Algorithm for Pronominal Anaphora Resolution , 1994, CL.

[43]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[44]  Klaus krippendorff,et al.  Measuring the Reliability of Qualitative Text Analysis Data , 2004 .

[45]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[46]  Iryna Gurevych,et al.  DKPro Agreement: An Open-Source Java Library for Measuring Inter-Rater Agreement , 2014, COLING.

[47]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[48]  Wolfgang Lezius,et al.  TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[49]  Maj-Britt Mosegaard Hansen,et al.  The Function of Discourse Particles: A Study with Special Reference to Spoken Standard French , 1998 .

[50]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[51]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[52]  Klaus Krippendorff,et al.  On the Reliability of Unitizing Continuous Data , 1995 .

[53]  Bonnie L. Webber,et al.  Discourse structure and language technology , 2011, Natural Language Engineering.

[54]  Mitsuru Ishizuka,et al.  HILDA: A Discourse Parser Using Support Vector Machine Classification , 2010, Dialogue Discourse.