Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank

We introduce TED-Multilingual Discourse Bank, a corpus of TED talks transcripts in 6 languages (English, German, Polish, European Portuguese, Russian and Turkish), where the ultimate aim is to provide a clearly described level of discourse structure and semantics in multiple languages. The corpus is manually annotated following the goals and principles of PDTB, involving explicit and implicit discourse connectives, entity relations, alternative lexicalizations and no relations. In the corpus, we also aim to capture the characteristics of spoken language that exist in the transcripts and adapt the PDTB scheme according to our aims; for example, we introduce hypophora. We spot other aspects of spoken discourse such as the discourse marker use of connectives to keep them distinct from their discourse connective use. TED-MDB is, to the best of our knowledge, one of the few multilingual discourse treebanks and is hoped to be a source of parallel data for contrastive linguistic analysis as well as language technology applications. We describe the corpus, the annotation procedure and provide preliminary corpus statistics.

[1]  Yuping Zhou,et al.  The Chinese Discourse TreeBank: a Chinese corpus annotated with discourse relations , 2015, Lang. Resour. Evaluation.

[2]  Hanne M. Eckhoff,et al.  Computational and Linguistic Issues in Designing a Syntactically Annotated Parallel Corpus of Indo-European Languages , 2009, Trait. Autom. des Langues.

[3]  Rashmi Prasad,et al.  Reflections on the Penn Discourse TreeBank, Comparable Corpora, and Complementary Annotation , 2014, CL.

[4]  Rashmi Prasad,et al.  Evaluation of Discourse Relation Annotation in the Hindi Discourse Relation Bank , 2012, LREC.

[5]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[6]  Deniz Zeyrek,et al.  Turkish Discourse Bank: Porting a discourse annotation style to a morphologically rich language , 2013, Dialogue Discourse.

[7]  Liesbeth Degand,et al.  Coding coherence relations: Reliability and validity , 2010 .

[8]  Andrei Popescu-Belis,et al.  What are discourse markers ? , 2003 .

[9]  Sandrine Zufferey,et al.  Using a unified taxonomy to annotate discourse markers in speech and writing , 2015, ACL 2015.

[10]  Alan Lee,et al.  Annotating Discourse Relations with the PDTB Annotator , 2016, COLING.

[11]  Ana González-Ledesma,et al.  Pragmatic Annotation of Discourse Markers in a Multilingual Parallel Corpus (Arabic- Spanish-English) , 2008, LREC.

[12]  Rashmi Prasad,et al.  Annotation of Discourse Relations for Conversational Spoken Dialogs , 2010, LREC.

[13]  Ani Nenkova,et al.  Easily Identifiable Discourse Relations , 2008, COLING.

[14]  J. Lavid,et al.  Towards a ‘Science’ of Corpus Annotation: A New Methodological Challenge for Corpus Linguistics , 2013 .

[15]  Gisela Redeker,et al.  Coherence and structure in text and discourse , 2000, Abduction, Belief and Context in Dialogue.

[16]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[17]  Rashmi Prasad,et al.  Realization of Discourse Relations by Other Means: Alternative Lexicalizations , 2010, COLING.

[18]  Alan Lee,et al.  A Discourse-Annotated Corpus of Conjoined VPs , 2016, LAW@ACL.

[19]  Andrei Popescu-Belis,et al.  Discourse-level Annotation over Europarl for Machine Translation: Connectives and Pronouns , 2012, LREC.

[20]  Alan Lee,et al.  Attribution and its annotation in the Penn Discourse TreeBank , 2006, Trait. Autom. des Langues.

[21]  Nathan Schneider,et al.  Filling in the Blanks in Understanding Discourse Adverbials: Consistency, Conflict, and Context-Dependence in a Crowdsourced Elicitation Task , 2016, LAW@ACL.

[22]  Peng Bi,et al.  Handbook of Linguistic Annotation , 2018, J. Quant. Linguistics.

[23]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[24]  V. Ambati,et al.  Cross Lingual Syntax Projection for Resource-Poor Languages Vamshi Ambati , 2007 .

[25]  J. Hobbs On the coherence and structure of discourse , 1985 .

[26]  Nicholas Asher,et al.  Reference to abstract objects in discourse , 1993, Studies in linguistics and philosophy.

[27]  Emanuele Pianta,et al.  Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus , 2005, Natural Language Engineering.

[28]  Alex Lascarides,et al.  Logics of Conversation , 2005, Studies in natural language processing.

[29]  Sidney Greenbaum,et al.  Comparing English worldwide : the International Corpus of English , 1996 .

[30]  Kerstin Fischer,et al.  Towards an understanding of the spectrum of approaches to discourse particles: introduction to the volume , 2006, Approaches to Discourse Particles.

[31]  Manfred Stede,et al.  Parallel Discourse Annotations on a Corpus of Short Texts , 2016, LREC.

[32]  Katja Markert,et al.  The Leeds Arabic Discourse Treebank: Annotating Discourse Connectives for Arabic , 2010, LREC.