COMBINA-PT: A Large Corpus-extracted and Hand-checked Lexical Database of Portuguese Multiword Expressions

This paper presents the COMBINA-PT project, a study of corpus-extracted Portuguese Multiword (MW) expressions. The objective of this on-going project is to compile a large lexical database of multiword (MW) units of the Portuguese language, automatically extracted from a balanced 50 million word corpus, and manually validated with the help of lexical association measures. MW expressions considered in the database include named entities and lexical associations with different degrees of cohesion, ranging from frozen groups, which undergo little or no variation, to lexical collocations composed of words that tend to occur together and that constitute syntactic dependencies, although with a low degree of fixedness. This new resource has a two-fold objective: (i) to be an important research tool which supports the development of MW expressions typologies and their lexicographic treatment; (ii) to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.

[1]  Amália Mendes,et al.  An electronic dictionary of collocations for European Portuguese: methodology, results and applications , 2002 .

[2]  Douglas Biber Investigating language use through corpusbased analyses of association patterns , 1996 .

[3]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[4]  Jeremy Clear,et al.  From Firth Principles — Computational Tools for the Study of Collocation , 1993 .

[5]  Igor Mel’čuk,et al.  Dictionnaire explicatif et combinatoire du français contemporain. Recherches lexico-sémantiques IV: Recherches lexico-sémantiques IV , 1999 .

[6]  Ulrich Heid Towards a corpus-based dictionary of German noun-verb collocations , 1998 .

[7]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[8]  Ralph Grishman,et al.  Towards Best Practice for Multiword Expressions in Computational Lexicons , 2002, LREC.

[9]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[10]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[11]  J. Bahns Lexical collocations: a contrastive view , 1993 .

[12]  F. Hausmann,et al.  Un dictionnaire des collocations est-il possible? , 1979 .

[13]  Darren Pearce A Comparative Evaluation of Collocation Extraction Techniques , 2002, LREC.

[14]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[15]  王 文昌,et al.  英语搭配大词典 = A dictionary of English collocations , 1991 .

[16]  Brigitte Krenn,et al.  CDB - A Database of Lexical Collocations , 2000, LREC.

[17]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[18]  J. R. Firth,et al.  Studies in Linguistic Analysis. , 1974 .

[19]  Sussi Olsen,et al.  Towards a Strategy for a Representation of Collocations - Extending the Danish PAROLE-lexicon , 2000, LREC.

[20]  C. I. Lewis The Modes of Meaning , 1943 .

[21]  Brigitte Krenn,et al.  The usual suspects: data-oriented models for identification und representation of lexical collocations , 1999 .

[22]  Brigitte Krenn Collocation Mining: Exploiting Corpora for Collocation, Identification and Representation , 2000, KONVENS.

[23]  Christopher S. Butler,et al.  Collocational frameworks in Spanish , 1998 .