QTLeap WSD/NED Corpora: Semantic Annotation of Parallel Corpora in Six Languages

This work presents parallel corpora automatically annotated with several NLP tools, including lemma and part-of-speech tagging, named-entity recognition and classification, named-entity disambiguation, word-sense disambiguation, and coreference. The corpora comprise both the well-known Europarl corpus and a domain-specific question-answer troubleshooting corpus on the IT domain. English is common in all parallel corpora, with translations in five languages, namely, Basque, Bulgarian, Czech, Portuguese and Spanish. We describe the annotated corpora and the tools used for annotation, as well as annotation statistics for each language. These new resources are freely available and will help research on semantic processing for machine translation and cross-lingual transfer.

[1]  Lynette Hirschman,et al.  A Model-Theoretic Coreference Scoring Scheme , 1995, MUC.

[2]  Khalid Choukri,et al.  The european language resources association , 1998, LREC.

[3]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[4]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[5]  Xavier Carreras,et al.  Named Entity Extraction using AdaBoost , 2002, CoNLL.

[6]  Laurent Romary,et al.  Outline of the International Standard Linguistic Annotation Framework , 2003, ACL.

[7]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[8]  António Branco,et al.  A Suite of Shallow Processing Tools for Portuguese: LX-Suite , 2006, EACL.

[9]  Amália Mendes,et al.  Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project , 2006, LREC.

[10]  Roberto Navigli,et al.  SemEval-2007 Task 07: Coarse-Grained English All-Words Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[11]  João Balsa,et al.  Combining Rule-based and Statistical Methods for Named Entity Recognition in Portuguese , 2007 .

[12]  Lluís Màrquez i Villodre,et al.  SemEval-2007 Task 09: Multilevel Semantic Annotation of Catalan and Spanish , 2007, SemEval@ACL.

[13]  Renata Vieira,et al.  Learning Coreference Resolution for Portuguese Texts , 2008, PROPOR.

[14]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[15]  Heeyoung Lee,et al.  A Multi-Pass Sieve for Coreference Resolution , 2010, EMNLP.

[16]  Nerea Ezeiza,et al.  Semantic Relatedness for Named Entity Disambiguation Using a Small Wikipedia , 2011, TSD.

[17]  Eneko Agirre,et al.  Methodology and construction of the Basque WordNet , 2011, Lang. Resour. Evaluation.

[18]  Nianwen Xue,et al.  CoNLL-2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes , 2011, CoNLL Shared Task.

[19]  Michal Novák,et al.  Resolving Noun Phrase Coreference in Czech , 2011, DAARC.

[20]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[21]  Marie Mikulová,et al.  Announcing Prague Czech-English Dependency Treebank 2.0 , 2012, LREC.

[22]  Ondrej Dusek,et al.  The Joy of Parallelism with CzEng 1.0 , 2012, LREC.

[23]  Pablo N. Mendes,et al.  Improving efficiency and accuracy in multilingual entity extraction , 2013, I-SEMANTICS '13.

[24]  Heeyoung Lee,et al.  Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules , 2013, CL.

[25]  Antske Fokkens,et al.  NAF and GAF: Linking Linguistic Annotations , 2014 .

[26]  German Rigau,et al.  IXA pipeline: Efficient and Ready to Use Multilingual NLP tools , 2014, LREC.

[27]  Jan Hajic,et al.  Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition , 2014, ACL.

[28]  Don Tuggener Coreference Resolution Evaluation for Higher Level Applications , 2014, EACL.

[29]  Ondrej Dusek,et al.  Using Parallel Texts and Lexicons for Verbal Word Sense Disambiguation , 2015, DepLing.

[30]  Xabier Arregi,et al.  Coreference Resolution for Morphologically Rich Languages. Adaptation of the Stanford System to Basque , 2015, Proces. del Leng. Natural.

[31]  Eneko Agirre,et al.  Studying the Wikipedia Hyperlink Graph for Relatedness and Disambiguation , 2015, ArXiv.

[32]  Jokin Pérez de Viñaspre Garralda Wikipedia eta anbiguetate lexikala , 2015 .