Multilingual Annotation and Disambiguation of Discourse Connectives for Machine Translation

Many discourse connectives can signal several types of relations between sentences. Their automatic disambiguation, i.e. the labeling of the correct sense of each occurrence, is important for discourse parsing, but could also be helpful to machine translation. We describe new approaches for improving the accuracy of manual annotation of three discourse connectives (two English, one French) by using parallel corpora. An appropriate set of labels for each connective can be found using information from their translations. Our results for automatic disambiguation are state-of-the-art, at up to 85% accuracy using surface features. Using feature analysis, contextual features are shown to be useful across languages and connectives.

[1]  Andrei Popescu-Belis,et al.  How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives , 2011, BUCC@ACL.

[2]  Ani Nenkova,et al.  Using Syntax to Disambiguate Explicit Discourse Connectives in Text , 2009, ACL.

[3]  B. Webber,et al.  Experiments on Sense Annotations and Sense Disambiguation of Discourse Connectives , 2005 .

[4]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[5]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[6]  Joakim Nivre,et al.  An Efficient Algorithm for Projective Dependency Parsing , 2003, IWPT.

[7]  Yann Mathet,et al.  ANNODIS: une approche outillée de l’annotation de structures discursives , 2009, JEPTALNRECITAL.

[8]  Laurence Danlos,et al.  LEXCONN: A French Lexicon of Discourse Connectives , 2010 .

[9]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[10]  Helmut Prendinger,et al.  A Novel Discourse Parser Based on Support Vector Machine Classification , 2009, ACL.

[11]  James Pustejovsky,et al.  Classification of Discourse Coherence Relations: An Exploratory Study using Multiple Knowledge Sources , 2006, SIGDIAL Workshop.

[12]  Pascal Denis,et al.  Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort , 2009, PACLIC.

[13]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[14]  Jason Baldridge,et al.  Discourse Connective Argument Identification with Connective Specific Rankers , 2008, 2008 IEEE International Conference on Semantic Computing.

[15]  Ani Nenkova,et al.  Easily Identifiable Discourse Relations , 2008, COLING.

[16]  Jirí Mírovský,et al.  Typical Cases of Annotators' Disagreement in Discourse Annotations in Prague Dependency Treebank , 2010, LREC.

[17]  Harry Bunt,et al.  Towards a Multidimensional Semantics of Discourse Markers in Spoken Dialogue , 2009, IWCS.

[18]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[19]  Manfred Stede,et al.  DiMLex: A Lexicon of Discorse Markers for Text Generation and Understanding , 1998, COLING-ACL.

[20]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[21]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[22]  Thomas Meyer,et al.  Disambiguating temporal-contrastive connectives for machine translation , 2011, ACL.