Multilingual Aspects of Multiword Lexical Units

As most of the machine-readable dictionaries contain clearly insufficient information about multiword lexical units, there is a constant need to extend and tune specialized lexical databases to account for new expressions. In this paper, we present a system exclusively based on statistics that massively extracts from unrestricted text corpora contiguous and noncontiguous rigid multiword lexical units. For that purpose, a new association measure called the Mutual Expectation is conjugated with a new acquisition process based on an algorithm of local maxima. The system has been applied to a Portuguese, French, English and Italian parallel corpus and has evidenced that multiword lexical units embody a great deal of cross-language regularities.

[1]  Yehuda Lindell,et al.  Text Mining at the Term Level , 1998, PKDD.

[2]  Geoffrey K. Pullum,et al.  Generalized Phrase Structure Grammar , 1985 .

[3]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[4]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[5]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[6]  Andrée Vansteelandt The BBI cominatory dictionary of English. A guide to word combinations , 1995 .

[7]  Didier Bourigault,et al.  LEXTER, a Natural Language Processing Tool for Terminology Extraction , 1996 .

[8]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[9]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[10]  Anne Abeillé,et al.  Les nouvelles syntaxes : grammaire d'unification et analyse du français , 1993 .

[11]  M. Benson The Structure of the Collocational Dictionary , 1989 .

[12]  Ingeborg Blank Computer-aided analysis of multilingual patent documentation , 1998 .

[13]  ChengXiang Zhai,et al.  Exploiting Context to Identify Lexical Atoms - A Statistical View of Linguistic Context , 1997, ArXiv.

[14]  José Gabriel Pereira Lopes,et al.  Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text Corpora , 1999 .

[15]  A. Cowie The Treatment of Collocations and Idioms in Learners' Dictionaries , 1981 .

[16]  F. Hausmann,et al.  Un dictionnaire des collocations est-il possible? , 1979 .

[17]  Béatrice Daille,et al.  Study and Implementation of Combined Techniques for Automatic Extraction of Terminology , 1994 .

[18]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[19]  Sayori Shimohata,et al.  Retrieving Collocations by Co-Occurrences and Word Order Constraints , 1997, ACL.

[20]  Max Silberztein,et al.  Le dictionnaire électronique des mots composés , 1990 .