ToNy: Contextual embeddings for accurate multilingual discourse segmentation of full documents

Segmentation is the first step in building practical discourse parsers, and is often neglected in discourse parsing studies. The goal is to identify the minimal spans of text to be linked by discourse relations, or to isolate explicit marking of discourse relations. Existing systems on English report F1 scores as high as 95%, but they generally assume gold sentence boundaries and are restricted to En-glish newswire texts annotated within the RST framework. This article presents a generic approach and a system, ToNy, a discourse segmenter developed for the DisRPT shared task where multiple discourse representation schemes, languages and domains are represented. In our experiments, we found that a straightforward sequence prediction architecture with pretrained contextual embeddings is sufficient to reach performance levels comparable to existing systems, when separately trained on each corpus. We report performance between 81% and 96% in F1 score. We also observed that discourse segmentation models only display a moderate generalization capability, even within the same language and discourse representation scheme.

[1]  Danushka Bollegala,et al.  A Sequential Model for Discourse Segmentation , 2010, CICLing.

[2]  Mirella Lapata,et al.  Discourse Chunking and its Application to Sentence Compression , 2005, HLT.

[3]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[4]  Iria da Cunha,et al.  The RST Spanish-Chinese Treebank , 2018, LAW-MWE-CxG@COLING.

[5]  N. H. van der Vliet,et al.  Syntax-based Discourse Segmentation of Dutch Text , 2010 .

[6]  Erick Galani Maziero,et al.  CSTNews - A Discourse-Annotated Corpus for Single and Multi-Document Summarization of News Texts in Brazilian Portuguese , 2011 .

[7]  Manfred Stede,et al.  Potsdam Commentary Corpus 2.0: Annotation for Discourse Research , 2014, LREC.

[8]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[9]  Gosse Bouma,et al.  Multi-Layer Discourse Annotation of a Dutch Text Corpus , 2012, LREC.

[10]  Daniel Marcu,et al.  Sentence Level Discourse Parsing using Syntactic and Lexical Information , 2003, NAACL.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Akira Shimazu,et al.  A Reranking Model for Discourse Segmentation using Subtree Features , 2012, SIGDIAL Conference.

[13]  Shafiq R. Joty,et al.  CODRA: A Novel Discriminative Framework for Rhetorical Analysis , 2015, CL.

[14]  Barbara Di Eugenio,et al.  Automatic Discourse Segmentation using Neural Networks , 2007 .

[15]  Amir Zeldes,et al.  The GUM corpus: creating multilayer resources in the classroom , 2016, Language Resources and Evaluation.

[16]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[17]  Christian R. Huyck,et al.  Generating Discourse Structures for Written Text , 2004, COLING.

[18]  Ludovic Tanguy,et al.  An empirical resource for discovering cognitive principles of discourse organisation: the ANNODIS corpus , 2012, LREC.

[19]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[20]  Christian R. Huyck,et al.  Generating discourse structures for written texts , 2004, COLING 2004.

[21]  Anders Søgaard,et al.  Does syntax help discourse segmentation? Not so much , 2017, EMNLP.

[22]  Brian Roark,et al.  The utility of parse-derived features for automatic discourse segmentation , 2007, ACL.

[23]  Timothy Dozat,et al.  Universal Dependency Parsing from Scratch , 2019, CoNLL.

[24]  Nicholas Asher,et al.  Discourse Structure and Dialogue Acts in Multiparty Dialogue: the STAC Corpus , 2016, LREC.

[25]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[26]  Eric SanJuan,et al.  DiSeg: Un segmentador discursivo automático para el español , 2010, Proces. del Leng. Natural.

[27]  Maite Taboada,et al.  A Syntactic and Lexical-Based Discourse Segmenter , 2009, ACL.

[28]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[29]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[30]  Yizhong Wang,et al.  Toward Fast and Accurate Neural Discourse Segmentation , 2018, EMNLP.

[31]  Jing Li,et al.  SegBot: A Generic Neural Text Segmentation Model with Pointer Network , 2018, IJCAI.

[32]  Alex Lascarides,et al.  Logics of Conversation , 2005, Studies in natural language processing.

[33]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[34]  Oier Lopez de Lacalle,et al.  The RST Basque TreeBank : an online search interface to check rhetorical relations , 2013 .

[35]  Deniz Zeyrek,et al.  Turkish Discourse Bank: Porting a discourse annotation style to a morphologically rich language , 2013, Dialogue Discourse.

[36]  Maria das Graças Volpe Nunes,et al.  On the Development and Evaluation of a Brazilian Portuguese Discourse Parser , 2008, RITA.

[37]  Eric SanJuan,et al.  DiSeg 1.0: The first system for Spanish discourse segmentation , 2012, Expert Syst. Appl..

[38]  Anders Søgaard,et al.  Cross-lingual and cross-domain discourse segmentation of entire documents , 2017, ACL.

[39]  Gerardo Sierra,et al.  On the Development of the RST Spanish Treebank , 2011, Linguistic Annotation Workshop.

[40]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.