Two approaches to compilation of bilingual multi-word terminology lists from lexical resources

Abstract In this paper, we present two approaches and the implemented system for bilingual terminology extraction that rely on an aligned bilingual domain corpus, a terminology extractor for a target language, and a tool for chunk alignment. The two approaches differ in the way terminology for the source language is obtained: the first relies on an existing domain terminology lexicon, while the second one uses a term extraction tool. For both approaches, four experiments were performed with two parameters being varied. In the experiments presented in this paper, the source language was English, and the target language Serbian, and a selected domain was Library and Information Science, for which an aligned corpus exists, as well as a bilingual terminological dictionary. For term extraction, we used the FlexiTerm tool for the source language and a shallow parser for the target language, while for word alignment we used GIZA++. The evaluation results show that for the first approach the F1 score varies from 29.43% to 51.15%, while for the second it varies from 61.03% to 71.03%. On the basis of the evaluation results, we developed a binary classifier that decides whether a candidate pair, composed of aligned source and target terms, is valid. We trained and evaluated different classifiers on a list of manually labeled candidate pairs obtained after the implementation of our extraction system. The best results in a fivefold cross-validation setting were achieved with the Radial Basis Function Support Vector Machine classifier, giving a F1 score of 82.09% and accuracy of 78.49%.

[1]  Nasredine Semmar A Hybrid Approach for Automatic Extraction of Bilingual Multiword Expressions from Parallel Corpora , 2018, LREC.

[2]  Cvetana Krstev,et al.  Using English Baits to Catch Serbian Multi-Word Terminology , 2018, LREC.

[3]  Nikola Ljubešić,et al.  Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages , 2012 .

[4]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[5]  Ahmet Aker,et al.  Extracting bilingual terminologies from comparable corpora , 2013, ACL.

[6]  Wiem Lahbib,et al.  Arabic-English Domain Terminology Extraction from Aligned Corpora , 2014, OTM Conferences.

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Emmanuel Morin,et al.  Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora , 2016, COLING.

[9]  Sophia Ananiadou,et al.  The English Language in the Digital Age , 2012 .

[10]  Sophia Ananiadou,et al.  A Hybrid Approach to Compiling Bilingual Dictionaries of Medical Terms from Parallel Corpora , 2014, SLSP.

[11]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[12]  Rodolfo Delmonte,et al.  Italian-Arabic domain terminology extraction from parallel corpora , 2015 .

[13]  Pierre Zweigenbaum,et al.  Identifying bilingual Multi-Word Expressions for Statistical Machine Translation , 2012, LREC.

[14]  Emanuele Pianta,et al.  The TextPro Tool Suite , 2008, LREC.

[15]  Radovan Garabík,et al.  Extraction and Presentation of Bilingual Correspondences from Slovak-Bulgarian Parallel Corpus , 2015 .

[17]  Stella Thebridge,et al.  Dictionary for Library and Information Science , 2005 .

[18]  Darja Fiser,et al.  Harvesting Multi-Word Expressions from Parallel Corpora , 2008, LREC.

[19]  Alun D. Preece,et al.  FlexiTerm: a flexible term recognition method , 2013, J. Biomed. Semant..

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Danushka Bollegala,et al.  A classification approach for detecting cross-lingual biomedical term translations , 2017, Nat. Lang. Eng..

[22]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[23]  Jean Véronis,et al.  Visualising a Text with a Tree Cloud , 2009 .

[24]  Hans Uszkoreit,et al.  The Serbian Language in the Digital Age , 2012 .

[25]  Yasser Muhammad Naguib Sabtan Bilingual Lexicon Extraction from Arabic-English Parallel Corpora with a View to Machine Translation , 2016 .

[26]  Natalia Grabar,et al.  Adaptation of Cross-Lingual Transfer Methods for the Building of Medical Terminology in Ukrainian , 2016, CICLing.

[27]  Yulia Tsvetkov,et al.  Extraction of Multi-word Expressions from Small Parallel Corpora , 2010, COLING.

[28]  Antoni Oliver A system for terminology extraction and translation equivalent detection in real time , 2017, Machine Translation.

[29]  Gregor Thurmair,et al.  Creating Term and Lexicon Entries from Phrase Tables , 2012, EAMT.

[30]  Chris Callison-Burch,et al.  End-to-end statistical machine translation with zero or small parallel texts , 2016, Nat. Lang. Eng..

[31]  Cvetana Krstev,et al.  Keyword-Based Search on Bilingual Digital Libraries , 2016, International KEYSTONE Conference.

[32]  Paul Buitelaar,et al.  Leveraging bilingual terminology to improve machine translation in a CAT environment* , 2017, Natural Language Engineering.

[33]  Ivan Obradović,et al.  Production of morphological dictionaries of multi-word units using a multipurpose tool , 2011 .

[34]  Masruddin Masruddin,et al.  The Efficacy of Using Language Experience Approach in Teaching Reading Fluency to Indonesian EFL Students , 2016 .

[35]  Sophia Ananiadou,et al.  Bilingual term alignment from comparable corpora in English discharge summary and Chinese discharge summary , 2015, BMC Bioinformatics.

[36]  Cvetana Krstev,et al.  Rule-based Automatic Multi-word Term Extraction and Lemmatization , 2016, LREC.

[37]  Béatrice Daille,et al.  Terminology Extraction with Term Variant Detection , 2016, ACL.

[38]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[39]  Rodney X. Sturdivant,et al.  Applied Logistic Regression: Hosmer/Applied Logistic Regression , 2005 .