Rule-based Automatic Multi-word Term Extraction and Lemmatization

In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from the mining domain containing more than 600,000 simple word forms. Extracted and lemmatized multi-word terms are filtered in order to reject falsely offered lemmas and then ranked by introducing measures that combine linguistic and statistical information (C-Value, T-Score, LLR, and Keyness). Mean average precision for retrieval of MWU forms ranges from 0.789 to 0.804, while mean average precision of lemma production ranges from 0.956 to 0.960. The evaluation showed that 94% of distinct multi-word forms were evaluated as proper multi-word units, and among them 97% were associated with correct lemmas.

[1]  Agata Savary,et al.  SEJFEK - a Lexicon and a Shallow Grammar of Polish Economic Multi-Word Units , 2012 .

[2]  Svetla Koeva Multi-word Term Extraction for Bulgarian , 2007, ACL 2007.

[3]  Adam Kilgarriff,et al.  The Sketch Engine: ten years on , 2014 .

[4]  Maciej Piasecki,et al.  Recognition of Structured Collocations in An Inflective Language , 2008 .

[5]  Ivan Obradović,et al.  Production of morphological dictionaries of multi-word units using a multipurpose tool , 2011 .

[6]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[7]  Laurent Romary,et al.  Automatic Construction of a TMF Terminological Database using a Transducer Cascade , 2015, RANLP.

[8]  Cvetana Krstev,et al.  Terminology Acquisition and Description Using Lexical Resources and Local Grammars , 2015, TIA.

[9]  Ibrahim Bounhas,et al.  A hybrid approach for Arabic multi-word term extraction , 2009, 2009 International Conference on Natural Language Processing and Knowledge Engineering.

[10]  Patrick Pantel,et al.  A Statistical Corpus-Based Term Extractor , 2001, Canadian Conference on AI.

[11]  Kenneth Ward Church,et al.  Using Statistics in Lexical Analysis , 2003, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon.

[12]  øöö Blockinø Knowledge-Free Induction of Morphology Using Latent Semantic Analysis , 2000 .

[13]  Witold Abramowicz,et al.  Lemmatization of Multi-Word Entity Names for Polish Language Using Rules Automatically Generated Based on the Corpus Analysis , 2015, LTC.

[14]  Chung-Hsing Yeh,et al.  A Multi-word Term Extraction System , 2006, PRICAI.

[15]  Alexander F. Gelbukh,et al.  Automatic Term Extraction Using Log-Likelihood Based Comparison with General Reference Corpus , 2010, NLDB.

[16]  Marcin Sydow,et al.  Lemmatization of Polish Person Names , 2007, ACL 2007.

[17]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[18]  Špela Vintar,et al.  Bilingual term recognition revisited: the bag-of-equivalents term alignment approach and its evaluation , 2010 .

[19]  Cvetana Krstev,et al.  An Approach to Efficient Processing of Multi-word Units , 2013, Computational Linguistics - Applications.