Cross-lingual and cross-domain discourse segmentation of entire documents

Discourse segmentation is a crucial step in building end-to-end discourse parsers. However, discourse segmenters only exist for a few languages and domains. Typically they only detect intra-sentential segment boundaries, assuming gold standard sentence and token segmentation, and relying on high-quality syntactic parses and rich heuristics that are not generally available across languages and domains. In this paper, we propose statistical discourse segmenters for five languages and three domains that do not rely on gold pre-annotations. We also consider the problem of learning discourse segmenters when no labeled data is available for a language. Our fully supervised system obtains 89.5% F1 for English newswire, with slight drops in performance on other domains, and we report supervised and unsupervised (cross-lingual) results for five languages in total.

[1]  Manfred Stede,et al.  The Potsdam Commentary Corpus , 2004, ACL 2004.

[2]  Shafiq R. Joty,et al.  CODRA: A Novel Discriminative Framework for Rhetorical Analysis , 2015, CL.

[3]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[4]  Maria das Graças Volpe Nunes,et al.  On the Development and Evaluation of a Brazilian Portuguese Discourse Parser , 2008, RITA.

[5]  Eric SanJuan,et al.  DiSeg: Un segmentador discursivo automático para el español , 2010, Proces. del Leng. Natural.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Graeme Hirst,et al.  A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing , 2014, ACL.

[8]  Lv Xueqiang,et al.  A New Ranking Method for Chinese Discourse Tree Building , 2015 .

[9]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[10]  Gosse Bouma,et al.  Building a Discourse-annotated Dutch Text Corpus , 2011 .

[11]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[12]  Eduard H. Hovy,et al.  Recursive Deep Models for Discourse Parsing , 2014, EMNLP.

[13]  Barbara Plank,et al.  Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss , 2016, ACL.

[14]  N. H. van der Vliet,et al.  Syntax-based Discourse Segmentation of Dutch Text , 2010 .

[15]  Erick Galani Maziero,et al.  CSTNews - A Discourse-Annotated Corpus for Single and Multi-Document Summarization of News Texts in Brazilian Portuguese , 2011 .

[16]  Barbara Di Eugenio,et al.  An effective Discourse Parser that uses Rich Linguistic Information , 2009, NAACL.

[17]  Sigrid Klerke,et al.  Improving sentence compression by learning to predict gaze , 2016, NAACL.

[18]  Omer Levy,et al.  A Strong Baseline for Learning Cross-Lingual Word Embeddings from Sentence Alignments , 2016, EACL.

[19]  Eric SanJuan,et al.  DiSeg 1.0: The first system for Spanish discourse segmentation , 2012, Expert Syst. Appl..

[20]  Jacob Eisenstein,et al.  Representation Learning for Text-level Discourse Parsing , 2014, ACL.

[21]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[22]  Maki Watanabe,et al.  Discourse Tagging Reference Manual , 2001 .

[23]  Mitsuru Ishizuka,et al.  HILDA: A Discourse Parser Using Support Vector Machine Classification , 2010, Dialogue Discourse.

[24]  BarzilayRegina,et al.  Multilingual part-of-speech tagging , 2009 .

[25]  Oier Lopez de Lacalle,et al.  The RST Basque TreeBank : an online search interface to check rhetorical relations , 2013 .

[26]  Gerardo Sierra,et al.  On the Development of the RST Spanish Treebank , 2011, Linguistic Annotation Workshop.

[27]  Akira Shimazu,et al.  A Reranking Model for Discourse Segmentation using Subtree Features , 2012, SIGDIAL Conference.

[28]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[29]  Ranjani Parthasarathi,et al.  An Approach to Discourse Parsing using Sangati and Rhetorical Structure Theory , 2012 .

[30]  Brian Roark,et al.  The utility of parse-derived features for automatic discourse segmentation , 2007, ACL.

[31]  Gosse Bouma,et al.  Multi-Layer Discourse Annotation of a Dutch Text Corpus , 2012, LREC.

[32]  Daniel Marcu,et al.  Sentence Level Discourse Parsing using Syntactic and Lexical Information , 2003, NAACL.

[33]  Jan Hajic,et al.  UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing , 2016, LREC.

[34]  Graeme Hirst,et al.  Adaptation of Discourse Parsing Models for the Portuguese Language , 2015, 2015 Brazilian Conference on Intelligent Systems (BRACIS).

[35]  Manfred Stede,et al.  Discourse Segmentation of German Texts , 2015, J. Lang. Technol. Comput. Linguistics.

[36]  Shafiq R. Joty,et al.  Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis , 2013, ACL.

[37]  Mirella Lapata,et al.  Discourse Chunking and its Application to Sentence Compression , 2005, HLT.

[38]  Barbara Di Eugenio,et al.  Automatic Discourse Segmentation using Neural Networks , 2007 .

[39]  Manfred Stede,et al.  Potsdam Commentary Corpus 2.0: Annotation for Discourse Research , 2014, LREC.