Automatic Acquisition of Terminological Resources for Information Extraction Applications

In this paper we present a method aiming at (semi-)automating the process of eliciting domain specific terminological resources, in the framework of information extraction applications. The method aims at linguistically processing machine-readable text corpora and extracting lists of candidate multi-word terms of the domain, that would then be validated by domain experts. The method proceeds in three pipelined stages: a) morphosyntactic annotation of the domain corpus, b) corpus parsing based on a pattern grammar endowed with regular expressions and feature-structure unification, c) lemmatisation. Candidate terms are then statistically evaluated with an aim to skim valid domain terms and lessen the overgeneration effect caused by pattern grammars. This hybrid methodology was tested on a software manual corpus, featuring a 62% recall. Out of 10 different statistical filters applied only on two-word terms, the best performing one further confirmed 30% of the index two-word terms and also reduced the size of the proposed list to 1/15.