From News to Medical: Cross-domain Discourse Segmentation

The first step in discourse analysis involves dividing a text into segments. We annotate the first high-quality small-scale medical corpus in English with discourse segments and analyze how well news-trained segmenters perform on this domain. While we expectedly find a drop in performance, the nature of the segmentation errors suggests some problems can be addressed earlier in the pipeline, while others would require expanding the corpus to a trainable size to learn the nuances of the medical domain.

[1]  Pınar Öztürk,et al.  Towards Text Mining in Climate Science:Extraction of Quantitative Variables and their Relations , 2014 .

[2]  Shafiq R. Joty,et al.  CODRA: A Novel Discriminative Framework for Rhetorical Analysis , 2015, CL.

[3]  Anita de Waard,et al.  Verb Form Indicates Discourse Segment Type in Biological Research Papers: Experimental Evidence. , 2012 .

[4]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[5]  Yizhong Wang,et al.  Toward Fast and Accurate Neural Discourse Segmentation , 2018, EMNLP.

[6]  Eric SanJuan,et al.  DiSeg 1.0: The first system for Spanish discourse segmentation , 2012, Expert Syst. Appl..

[7]  Anders Søgaard,et al.  Cross-lingual and cross-domain discourse segmentation of entire documents , 2017, ACL.

[8]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[9]  An Yang,et al.  SciDTB: Discourse Dependency TreeBank for Scientific Abstracts , 2018, ACL.

[10]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[11]  Brian Roark,et al.  The utility of parse-derived features for automatic discourse segmentation , 2007, ACL.

[12]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[13]  Oier Lopez de Lacalle,et al.  The RST Basque TreeBank : an online search interface to check rhetorical relations , 2013 .

[14]  Graeme Hirst,et al.  Two-pass Discourse Segmentation with Pairing and Global Features , 2014, ArXiv.

[15]  Hong Yu,et al.  The biomedical discourse relation bank , 2011, BMC Bioinformatics.

[16]  Gosse Bouma,et al.  Multi-Layer Discourse Annotation of a Dutch Text Corpus , 2012, LREC.

[17]  Daniel Marcu,et al.  Sentence Level Discourse Parsing using Syntactic and Lexical Information , 2003, NAACL.

[18]  Noah A. Smith,et al.  Neural Discourse Structure for Text Categorization , 2017, ACL.

[19]  W. Mann,et al.  Rhetorical Structure Theory: looking back and moving ahead , 2006 .

[20]  Fan Zhang,et al.  Inferring Discourse Relations from PDTB-style Discourse Labels for Argumentative Revision Classification , 2016, COLING.

[21]  Ani Nenkova,et al.  Using Syntax to Disambiguate Explicit Discourse Connectives in Text , 2009, ACL.

[22]  Svetlana Toldova,et al.  Rhetorical relations markers in Russian RST Treebank , 2017 .

[23]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[24]  Anders Søgaard,et al.  Does syntax help discourse segmentation? Not so much , 2017, EMNLP.

[25]  Vanessa Wei Feng,et al.  RST-style Discourse Parsing and Its Applications in Discourse Analysis , 2015 .

[26]  Parminder Bhatia,et al.  Better Document-level Sentiment Analysis from RST Discourse Parsing , 2015, EMNLP.

[27]  Amir Zeldes,et al.  The GUM corpus: creating multilayer resources in the classroom , 2016, Language Resources and Evaluation.

[28]  Maki Watanabe,et al.  Discourse Tagging Reference Manual , 2001 .

[29]  Chuan Wang,et al.  Discourse Segmentation for Building a RST Chinese Treebank , 2017 .