论文信息 - Acquisition of Lexical Information from a Large Textual Italian Corpus

Acquisition of Lexical Information from a Large Textual Italian Corpus

The creation and development of a large Lexical Database (LDB) which, until now, mainly reuses the data found in standard Machine Readable Dictionaries, has been going on in Pisa for a number of years (see Calzolari 1984, 1988, Calzolari, Picchi 1988). We are well aware that, in order to build a more powel-ful I.DB (or even a Le.'dcal Knowledge Base) to be used in different ComputationM l.inguistics (CL) applications, types of information other than those usually found in machine readable dictionaries are urgently needed. Different sources of information must therefore be exploited if we wemt to overcome the 'qchcal bottleneck ~" of Natural l..anguage Processing ( N I P ) . In a trend which is becoming increasingly relevant both m c1 proper and in Literao" and Iinguistic Computing, we feel that very interesting data ibr our LI)Bs c:m be found b.v processing large textuM corpora, where the actual usage of the language can be truly investigated. Many research projects are nowadays collecting large amounts of textuM data, thus providing more and more material to be analyzed for descriptions based on measurable evidence of how language is actually used. We uhhnately aim at integrating lexical data extracted from the an',dysis of large textual corpora into the I,DB we are implementing. These data refer, typically, to: i) complementation relations introduced by prepositions (e.g. dividere subcategorizes for a PP headed by the preposition in ha one sense, and by the preposition fra in another sense); ii) lexically conditioned modification relations (tena macchina potente , un farmaco potente and not /brte , while un cajfe" forte , una moneta forte and not potente ); iii) lefically significant collocations (premiere ut~a decisione and not fare z~na decisione , prestare attenzione and not dare ); iv) fixed phrases and idioms I (donna itz carriera, dottorato di ricerca, a propo~ito di); v) compounds ( tarola calda, ~ave scuo/a). All these types of data are a major issue of practical relevance, and particularly problematic, in many N I P applications in different areas. They should therefore be dvcn very lm'ge coverage in any useful LDB, and, moreover, should also be annotated, in a computerk,'ed lexicon, for the pe~inent t)equency information obtained fiom the processed corpus, and obviously updated fl"om time to time. As a matter of fact, dictionaries now tend to encode all the theoreticcd possibilities on a same level, but "if e ; 'e~ possibility in the diction:m, must be given equal weight, parsing is very diificult" (Church 1988, p.3): they should provide infornaation on what is more likely to occur, e.g. relative likelihood of alternate pm-ts of speech for a word or of ahernate word-senses, both out of context and it" possible taking into account contextu~d factors. Statistical anMyses of linguistic data were very popular in the "50s and '60s, mainly, even though not only, for literary types of analyses and for studies on the lexicon (Guiraud 1959, Muller 1964, Moskovich 1977). Stochastic approaches to linguistic analyses have been strongly reevaluated in the past few years, either for syntactic analysis (Gm'side et al. 1987, Church 1988), or for NLP applications (Brown et al. 1988), or for semantic analysis (Zemik 1989, Smadja 1989). Quantitative (not statistical) evidence on e.g. word-sense occurrences in a large corpus have been taken into account for lexicographic descriptions (Cobuild 1%7).

Nicoletta Calzolari | Remo Bindi | N. Calzolari | R. Bindi

[1] Frank A. Smadja,et al. Microcoding the Lexicon with Co-occurrence Knowledge , 1989 .

[2] Nicoletta Calzolari,et al. Acquisition of Semantic Information From an On-Line Dictionary , 1988, COLING.

[3] John Sinclair,et al. Collins COBUILD English Language Dictionary , 1987 .

[4] John Cocke,et al. A Statistical Approach to Language Translation , 1988, COLING.

[5] P. Guiraud. Problèmes et méthodes de la statistique linguistique , 1960 .

[6] Mitchell P. Marcus,et al. Automatic Acquisition of the Lexical Semantics of Verbs from Sentence Frames , 1989, ACL.

[7] Kenneth Ward Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[8] Charles Muller. Fréquence, dispersion et usage - À propos des dictionnaires de fréquence , 1965 .

[9] Alain Polguère,et al. A Formal Lexicon in the Meaning-Text Theory (or How to Do Lexica with Words) , 1987, Comput. Linguistics.

[10] Nicoletta Calzolari,et al. The dictionary and the thesaurus can be combined , 1989 .

[11] Nicoletta Calzolari,et al. Detecting Patterns in a Lexical Data Base , 1984, ACL.

[12] Carlo Tagliavini,et al. Lessico di frequenza della lingua Italiana contemporanea , 1972 .

[13] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[14] Donald Hindle,et al. Acquiring Disambiguation Rules from Text , 1989, ACL.