A Contrastive Approach to Term Extraction

Many approaches to corpus-driven terminology extraction are based on symbolic (i.e. purely syntactic), statistical, and hybrid models (Jacquemin, 1997). Different statistical measures for selecting terminological expressions among candidates observed in the source corpus have been comparatively studied in (Daille, 1994): simple frequency is suggested as the more effective for the task. However, it is still far from representing a satisfactory discriminating function. The wide evidence collected by previous studies suggests that term detection should make use of more information that the observable distributional behavior of candidate terms. Better models should be derived over different sample spaces rather than in the refinement of probabilistic measures in the target domain. Traditionally all the suggested measures are related to a single target domain from which distributional information is derived. In this paper a contrastive approach to statistical term extraction based upon selection/filtering criteria that capitalizes on differences among domains is proposed. The method relies on a grammatical candidate extraction component and a cross-domain statistical measure as a term selection model. Experiments over the target domain against a reference terminological database show an improvement of the proposed method over simple frequency