Representation and Treatment of Multiword Expressions in Basque

This paper describes the representation of Basque Multiword Lexical Units and the automatic processing of Multiword Expressions. After discussing and stating which kind of multiword expressions we consider to be processed at the current stage of the work, we present the representation schema of the corresponding lexical units in a general-purpose lexical database. Due to its expressive power, the schema can deal not only with fixed expressions but also with morphosyntactically flexible constructions. It also allows us to lemmatize word combinations as a unit and yet to parse the components individually if necessary. Moreover, we describe HABIL, a tool for the automatic processing of these expressions, and we give some evaluation results. This work must be placed in a general framework of written Basque processing tools, which currently ranges from the tokenization and segmentation of single words up to the syntactic tagging of general texts.

[1]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .

[2]  J. M. Arriola,et al.  Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages , 1998, ACL.

[3]  José Gabriel Pereira Lopes,et al.  Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units , 1999, EPIA.

[4]  Bernardo Magnini,et al.  A WordNet-Based Approach to Named Entites Recognition , 2022 .

[5]  Ray Jackendoff,et al.  The Architecture of the Language Faculty , 1996 .

[6]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[7]  Geert Adriaens,et al.  The lexical unit in the metal® MT system , 2004, Machine Translation.

[8]  Gregory Grefenstette,et al.  Regular expressions for language engineering , 1996, Natural Language Engineering.

[9]  Timothy Baldwin,et al.  Multiword expressions: linguistic precision and reusability , 2002, LREC.

[10]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[11]  Sergi Cervell,et al.  An environment for mophosyntactic processing of unrestricted Spanish text , 1998 .

[12]  Olatz Ansa,et al.  EDBL: a General Lexical Basis for the Automatic Processing of Basque , 2006 .

[13]  Nerea Ezeiza Ramos Corpusak ustiatzeko tresna linguistikoak , 2003 .

[14]  Itziar Aduriz,et al.  A Cascaded Syntactic Analyser for Basque , 2004, CICLing.

[15]  Geoffrey Leech,et al.  CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[16]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.