Semi-automatic extraction of multiword terms from domain-specific corpora

Purpose A hybrid approach is presented, which combines linguistic and statistical information to semi-automatically extract multiword term candidates from texts. Design/methodology/approach The method is designed to be domain and language independent, focusing on languages with rich morphology. Here, it is used for extracting multiword terms from texts in Serbian, belonging to the agricultural engineering domain, as a use case. Predefined syntactic structures were used for multiword terms. For each structure, a finite state transducer was developed, which recognizes text sequences having that structure and outputs the sequence in a normalized form, so that different inflectional forms of the same multiword term can be counted properly. Term candidates were further filtered by their frequencies and evaluated by two domain experts. Findings By using language resources, such as electronic dictionaries and grammars, 928 multiword terms were extracted out of 1,523 multiword terms that were recognized as candidates from a corpus having 42,260 different simple word forms; 870 of these were new, not already contained in the existing electronic dictionary of compounds for Serbian, and they were used to enrich the dictionary. Originality/value The paper presents methodology that can significantly contribute to the development of terminology lexicons in different areas. In this particular use case, some important agricultural engineering concepts were extracted from the text, but this approach could be used for other domains and languages as well.

[1]  Yulia Tsvetkov,et al.  Identification of Multiword Expressions by Combining Multiple Linguistic Information Sources , 2014, Computational Linguistics.

[2]  Portal Web Portal für das Graduiertenstudium in den Kultu CultDoc Download: Scott, Mike/Tribble, Christopher, Textual Patterns. Key Words and Corpus Analysis in Language Education, Reihe: Studies in Corpus Linguistics 22, Amsterdam, Philadelphia: John Benjamins Pub., 2006. , 2016 .

[3]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[4]  Agata Savary,et al.  SEJFEK - a Lexicon and a Shallow Grammar of Polish Economic Multi-Word Units , 2012 .

[5]  Cvetana Krstev,et al.  Rule-based Automatic Multi-word Term Extraction and Lemmatization , 2016, LREC.

[6]  Svetla Koeva Multi-word Term Extraction for Bulgarian , 2007, ACL 2007.

[7]  Agata Savary,et al.  Computational Inflection of Multi-Word Units, a contrastive study of lexical approaches , 2009 .

[8]  Malvina Nissim,et al.  Creation of Lexical Resources for a Characterisation of Multiword Expressions in Italian , 2010, LREC.

[9]  Dan Wu,et al.  Bilingual Terminology Extraction Using Multi-level Termhood , 2012, Electron. Libr..

[10]  Agata Savary,et al.  SEJF - A Grammatical Lexicon of Polish Multiword Expressions , 2015, LTC.

[11]  Paola Velardi,et al.  TermExtractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities , 2007, IESA.

[12]  Tomas Krilavicius,et al.  Identification of Multiword Expressions for Latvian and Lithuanian: Hybrid Approach , 2017, MWE@EACL.

[13]  Dirk,et al.  On dissolving morphological long distance dependencies in Russian verbs , 2009 .

[14]  M. Silberztein,et al.  Dictionnaires électroniques du français , 1990 .

[15]  Joseph-Jean Mariani,et al.  Developing Language Technologies with the Support of Language Resources and Evaluation Programs , 2005, Lang. Resour. Evaluation.

[16]  Preslav Nakov,et al.  Semantic interpretation of noun compounds using verbal and other paraphrases , 2013, TSLP.

[17]  Sophia Ananiadou,et al.  The C-value/NC-value Method of Automatic Recognition for Multi-Word Terms , 1998, ECDL.

[18]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[19]  Horacio Rodríguez,et al.  Evaluation of terms and term extraction systems: A practical approach , 2007 .

[20]  Wessel Kraaij,et al.  Evaluation and analysis of term scoring methods for term extraction , 2016, Information Retrieval Journal.