A Preliminary Study for Building the Basque PropBank

This paper presents a methodology for adding a layer of semantic annotation to a syntactically annotated corpus of Basque (EPEC), in terms of semantic roles. The proposal we make here is the combination of three resources: the model used in the PropBank project (Palmer et al., 2005), an in-house database with syntactic/semantic subcategorization frames for Basque verbs (Aldezabal, 2004) and the Basque dependency treebank (Aduriz et al., 2003). In order to validate the methodology and to confirm whether the PropBank model is suitable for Basque and our treebank design, we have built lexical entries and labelled all argument and adjuncts occurring in our treebank for 3 Basque verbs. The result of this study has been very positive, and has produced a methodology adapted to the characteristics of the language and the Basque dependency treebank. Another goal of this study was to study whether semi-automatic tagging was possible. The idea is to present the human taggers a pre-tagged version of the corpus. We have seen that many arguments could be automatically tagged with high precision, given only the verbal entries for the verbs and a handful of examples.

[1]  C. Baird,et al.  The pilot study. , 2000, Orthopedic nursing.

[2]  Roser Morante,et al.  4LEX : a Multilingual Lexical resource , 2005 .

[3]  Eneko Agirre,et al.  A Pilot Study of English Selectional Preferences and Their Cross-Lingual Compatibility with Basque , 2003, TSD.

[4]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[5]  Kepa Sarasola,et al.  Application of finite-state transducers to the acquisition of verb subcategorization information , 2003, Nat. Lang. Eng..

[6]  Xabier Arregi,et al.  XUXEN: A Spelling Checker/Corrector for Basque Based on Two-Level Morphology , 1992, ANLP.

[7]  M. González Rodríguez,et al.  Proceedings of the third International Conference on Language Resources and Evaluation , 2002 .

[8]  Díaz de Ilarraza Construction of a Basque Dependency Treebank , 2003 .

[9]  김두식,et al.  English Verb Classes and Alternations , 2006 .

[10]  Díaz de Ilarraza,et al.  A framework for the automatic processing of Basque , 2007 .

[11]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[12]  Nianwen Xue,et al.  Annotating the Propositions in the Penn Chinese Treebank , 2003, SIGHAN.

[13]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.

[14]  Arantza Díaz de Ilarraza Sánchez,et al.  EUSLEM: Un lematizador/etiquetador de textos en euskara , 1994 .

[15]  Juan Aparicio,et al.  3LB-LEX: léxico verbal con frames sintáctico-semánticos , 2005, Proces. del Leng. Natural.

[16]  Mikel Lersundi Ayestaran Ezagutza base lexikala eraikitzeko euskal hiztegiko definizioen azterketa sintaktiko semantikoa. Hitzen arteko erlazio léxico semantikoak, definizio patroiak, eratorpena eta postpsizioak , 2005 .

[17]  Eneko Agirre,et al.  Lexicalization and Multiword Expressions in the Basque WordNet , 2006 .

[18]  Petr Pajas,et al.  PDT-VALLEX : Creating a Large-coverage Valency Lexicon for Treebank Annotation , 2003 .

[19]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[20]  Izaskun Aldeazabal Roteta Aditz-azpikategorizazioaren azterketa sintaxi partzialetik sintaxi osorako bidean. 100 aditzen azterketa, Levin-en (1993) lana oinarri hartuta eta metodo automatikoak baliatuz , 2004 .

[21]  Eneko Agirre,et al.  A methodology for the joint development of the Basque WordNet and Semcor , 2006, LREC.