A methodology for the joint development of the Basque WordNet and Semcor

This paper describes the methodology adopted to jointly develop the Basque WordNet and a hand annotated corpora (the Basque Semcor). This joint development allows for better motivated sense distinctions, and a tighter coupling between both resources. The methodology involves edition, tagging and refereeing tasks. We are currently half way through the nominal part of the 300.000 word corpus (roughly equivalent to a 500.000 word corpus for English). We present a detailed description of the task, including the main criteria for difficult cases in the edition of the senses and the tagging of the corpus, with special mention to multiword entries. Finally we give a detailed picture of the current figures, as well as an analysis of the agreement rates.

[1]  J. M. Cohen,et al.  Mexico City : México , 1965 .

[2]  Xabier Artola Zubillaga Hiztsua: hiztegi-sistema urgazle adimendunaren sorkuntza eta eraikuntza conception et construction d'un systeme intelligent d'aide dictionnariale (siad) , 1993 .

[3]  I. R. McCaig,et al.  Oxford Dictionary of Current Idiomatic English , 1994 .

[4]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[5]  Horacio Rodríguez,et al.  Combining Multiple Methods for the Automatic Construction of Multilingual WordNets , 1997, ArXiv.

[6]  Adam Kilgarriff,et al.  Gold standard datasets for evaluating word sense disambiguation programs , 1998, Comput. Speech Lang..

[7]  Eneko Agirre,et al.  Extracción de relaciones léxico-semánticas a partir de palabras derivadas usando patrones de definición , 2001, Proces. del Leng. Natural.

[8]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[9]  Díaz de Ilarraza Construction of a Basque Dependency Treebank , 2003 .

[10]  B. Navarro,et al.  Syntactic , semantic and pragmatic annotation in Cast 3 LB , 2003 .

[11]  Igone Zabala Unzalu,et al.  Las fronteras de la composición en lenguas románicas y en vasco , 2004 .

[12]  Geert Adriaens,et al.  The lexical unit in the metal® MT system , 2004, Machine Translation.

[13]  Piek Vossen,et al.  The MEANING Multilingual Central Repository , 2004 .

[14]  J. Contreras,et al.  Los procesos de lexicalización , 2004 .

[15]  Emanuele Pianta,et al.  Extending WordNet with Syntagmatic Information , 2004 .

[16]  Emanuele Pianta,et al.  Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus , 2005, Natural Language Engineering.

[17]  Juan Aparicio,et al.  3LB-LEX: léxico verbal con frames sintáctico-semánticos , 2005, Proces. del Leng. Natural.

[18]  Eneko Agirre,et al.  Lexicalization and Multiword Expressions in the Basque WordNet , 2006 .

[19]  Eneko Agirre,et al.  Improving the Basque WordNet by Corpus Annotation , 2006 .

[20]  Eneko Agirre,et al.  A Preliminary Study for Building the Basque PropBank , 2006, LREC.