Good Examples for Terminology Databases in Translation Industry

This paper deals with finding good examples for terminology database entries in the translation industry. When extracting terms from bilingual translation memory exchange files, it is very easy to also extract example sentences to showcase the use of the term in practice. However, there are usually a lot of sentences containing the term and selecting an appropriate example is not a straightforward task. In this paper, we explore the use of data mining techniques to find good term examples. After constructing the corpus from a large English-Slovenian bilingual file from a financial domain, we extract linguistic features and load them into the Weka data mining environment to analyze the performance of various classifiers, resulting in 0.8 precision for positive class (good examples) and 0.85 overall accuracy. While the model was tested only on one language combination, the nature of most features is language-independent which suggests that the model could be used successfully for other language combinations

[1]  António Branco,et al.  Coping with highly imbalanced datasets: A case study with definition extraction in a multilingual setting , 2013, Natural Language Engineering.

[2]  Paola Velardi,et al.  Learning Word-Class Lattices for Definition and Hypernym Extraction , 2010, ACL.

[3]  Iztok Kosem,et al.  GDEX for Slovene , 2011 .

[4]  Angelika Storrer,et al.  Automated detection and annotation of term definitions in German text corpora , 2006, LREC.

[5]  Caroline Barrière Knowledge-Rich Contexts Discovery , 2004, Canadian Conference on AI.

[6]  Eline Westerhout,et al.  Definition extraction for glossary creation: A study on extracting definitions for semi-automatic glossary creation in Dutch , 2010 .

[7]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[8]  Adam Kilgarriff,et al.  GDEX: Automatically Finding Good Dictionary Examples in a Corpus , 2008 .

[9]  Nikola Ljubesic,et al.  New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian , 2016, LREC.

[10]  Nada Lavrac,et al.  NLP workflow for on-line definition extraction from English and Slovene text corpora , 2012, KONVENS.

[11]  Nikola Ljubešić,et al.  Predicting corpus example quality via supervised machine learning , 2015 .

[12]  Pierre Zweigenbaum,et al.  Detecting Semantic Relations between Terms in Definitions , 2004 .

[13]  Ingrid Meyer Extracting knowledge-rich contexts for terminography , 2001 .

[14]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.