This paper describes the automatic extraction of the terminology of a speci c domain from a large corpus. The use of statistical methods yields a number of solutions, but these produce a considerable amount of noise. The task we have concentrated on is the creation and testing of an original method to reduce high noise rates by combining linguistic data and statistical methods. Starting from a rigorous linguistic study of terms in the domain of telecommunications, we designed a number of lters which enable one to obtain a rst selection of sequences that may be considered as terms on the grounds of morphosyntactic criteria. Various statistical methods are applied to this selection and the results are evaluated. The best statistical model that is to say, the one that gives a correct list of terms with the lowest rates of noise and silence turns out to be the one based on the likelihood ratio in which frequency is taken into account. This result contradicts numerous previous results regarding the extraction of lexical resources, which claim that the association ratio (for example the use of mutual information) is more signi cant than frequency.
[1]
Éric Gaussier,et al.
Towards Automatic Extraction of Monolingual and Bilingual Terminology
,
1994,
COLING.
[2]
Christian Jacquemin,et al.
Transformations des noms composés
,
1991
.
[3]
Kathleen McKeown,et al.
Automatically Extracting and Representing Collocations for Language Generation
,
1990,
ACL.
[4]
Ted Dunning,et al.
Accurate Methods for the Statistics of Surprise and Coincidence
,
1993,
CL.
[5]
Kenneth Ward Church,et al.
Word Association Norms, Mutual Information, and Lexicography
,
1989,
ACL.
[6]
Nicoletta Calzolari,et al.
Acquisition of Lexical Information from a Large Textual Italian Corpus
,
1990,
COLING.
[7]
John Cocke,et al.
A Statistical Approach to Language Translation
,
1988,
COLING.