Combined approach for terminology extraction: lexical statistics and linguistic filtering

This paper describes the automatic extraction of the terminology of a speci c domain from a large corpus. The use of statistical methods yields a number of solutions, but these produce a considerable amount of noise. The task we have concentrated on is the creation and testing of an original method to reduce high noise rates by combining linguistic data and statistical methods. Starting from a rigorous linguistic study of terms in the domain of telecommunications, we designed a number of lters which enable one to obtain a rst selection of sequences that may be considered as terms on the grounds of morphosyntactic criteria. Various statistical methods are applied to this selection and the results are evaluated. The best statistical model that is to say, the one that gives a correct list of terms with the lowest rates of noise and silence turns out to be the one based on the likelihood ratio in which frequency is taken into account. This result contradicts numerous previous results regarding the extraction of lexical resources, which claim that the association ratio (for example the use of mutual information) is more signi cant than frequency.