Translation as Annotation

In this paper we illustrate an approach to the creation of high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the key notion that translating a text can be seen as a linguistic annotation task which is easier than manual annotation with formal schemes. After translation, formal annotations can be automatically derived from aligned translated texts. We will show that translations can be exploited in various interesting ways to speed up and automate the linguistic annotation of texts. If none of the texts is already annotated, information from aligned texts can be exploited to carry out the annotation from scratch. On the contrary, if the texts in one language have been annotated and the others have not, annotations can be transferred from one language to the other. The transferbased method allows for the exploitation of existing (mostly English) annotated resources to bootstrap the creation of annotated corpora in new languages with highly reduced human effort.

[1]  Alon Itai,et al.  Two Languages Are More Informative Than One , 1991, ACL.

[2]  Robert L. Mercer,et al.  Word-Sense Disambiguation Using Statistical Methods , 1991, ACL.

[3]  Kenneth Ward Church,et al.  Using bilingual materials to develop word sense disambiguation methods , 1992, TMI.

[4]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5]  Mona Baker,et al.  'Corpus Linguistics and Translation Studies: Implications and Applications' , 1993 .

[6]  Alon Itai,et al.  Word Sense Disambiguation Using a Second Language Monolingual Corpus , 1994, CL.

[7]  Mona T. Diab,et al.  An Unsupervised Method for Multilingual Word Sense Tagging Using Parallel Corpora , 2000, ACL 2000.

[8]  Luisa Bentivogli,et al.  Looking for lexical gaps , 2000 .

[9]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[10]  Philip Resnik,et al.  Spanish Language Processing at University of Maryland: Building Infrastructure for Multilingual Applications , 2001 .

[11]  Nadia Mana,et al.  The Lexico-semantic Annotation of an Italian Treebank , 2002, LREC.

[12]  Philip Resnik,et al.  Word-level Alignment for Multilingual Resource Acquisition , 2002 .

[13]  Philip Resnik,et al.  Breaking the Resource Bottleneck for Multilingual Parsing , 2002 .

[14]  Emanuele Pianta,et al.  Opportunistic Semantic Tagging , 2002, LREC.

[15]  Philip Resnik,et al.  A Perspective on Word Sense Disambiguation Methods and Their Evaluation , 2002 .

[16]  Emanuele Pianta,et al.  The MEANING Italian Corpus , 2003 .

[17]  Helge Dyvik,et al.  Translations as semantic mirrors: from parallel corpus to wordnet , 2004 .