TermFinder: log-likelihood comparison and phrase-based statistical machine translation models for bilingual terminology extraction

Bilingual termbanks are important for many natural language processing applications, especially in translation workflows in industrial settings. In this paper, we apply a log-likelihood comparison method to extract monolingual terminology from the source and target sides of a parallel corpus. The initial candidate terminology list is prepared by taking all arbitrary n-gram word sequences from the corpus. Then, a well-known statistical measure (the Dice coefficient) is employed in order to remove any multi-word terms with weak associations from the candidate term list. Thereafter, the log-likelihood comparison method is applied to rank the phrasal candidate term list. Then, using a phrase-based statistical machine translation model, we create a bilingual terminology with the extracted monolingual term lists. We integrate an external knowledge source—the Wikipedia cross-language link databases—into the terminology extraction (TE) model to assist two processes: (a) the ranking of the extracted terminology list, and (b) the selection of appropriate target terms for a source term. First, we report the performance of our monolingual TE model compared to a number of the state-of-the-art TE models on English-to-Turkish and English-to-Hindi data sets. Then, we evaluate our novel bilingual TE model on an English-to-Turkish data set, and report the automatic evaluation results. We also manually evaluate our novel TE model on English-to-Spanish and English-to-Hindi data sets, and observe excellent performance for all domains.

[1]  Patrick Pantel,et al.  A Statistical Corpus-Based Term Extractor , 2001, Canadian Conference on AI.

[2]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[3]  Fred J. Damerau,et al.  Generating and Evaluating Domain-Oriented Multi-Word Terms from Texts , 1993, Inf. Process. Manag..

[4]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[5]  Éric Gaussier Flow Network Models for Word Alignment and Terminology Extraction from Bilingual Corpora , 1998, COLING-ACL.

[6]  Éric Gaussier,et al.  Towards Automatic Extraction of Monolingual and Bilingual Terminology , 1994, COLING.

[7]  Ondrej Bojar,et al.  HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation , 2014, LREC.

[8]  Timothy Baldwin,et al.  An Unsupervised Approach to Domain-Specific Term Extraction , 2009 .

[9]  Véronique Hoste,et al.  Language-Independent Bilingual Terminology Extraction from a Multilingual Parallel Corpus , 2009, EACL.

[10]  Roberto Basili,et al.  A Contrastive Approach to Term Extraction , 2001 .

[11]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[12]  Hugo Gonçalo Oliveira,et al.  Towards the Automatic Creation of a Wordnet from a Term-Based Lexical Network , 2010, TextGraphs@ACL.

[13]  Darja Fiser,et al.  Harvesting Multi-Word Expressions from Parallel Corpora , 2008, LREC.

[14]  Paul Buitelaar,et al.  Identification of Bilingual Terms from Monolingual Documents for Statistical Machine Translation , 2014 .

[15]  Andy Way,et al.  Bilingual Termbank Creation via Log-Likelihood Comparison and Phrase-Based Statistical Machine Translation , 2014 .

[16]  Pierre Zweigenbaum,et al.  Identifying bilingual Multi-Word Expressions for Statistical Machine Translation , 2012, LREC.

[17]  Andy Way,et al.  DCU-Lingo24 Participation in WMT 2014 Hindi-English Translation task , 2014, WMT@ACL.

[18]  Qun Liu,et al.  Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions , 2009, MWE@IJCNLP.

[19]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[20]  Paul Buitelaar,et al.  Enhancing statistical machine translation with bilingual terminology in a CAT environment , 2014, AMTA.

[21]  Sophia Ananiadou,et al.  A Methodology for Automatic Term Recognition , 1994, COLING.

[22]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[23]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[24]  Ralf Steinberger,et al.  Building a Multilingual Named Entity-Annotated Corpus Using Annotation Projection , 2011, RANLP.

[25]  Yurdaer N. Doganata,et al.  Glossary extraction and utilization in the information search and delivery system for IBM Technical Support , 2004, IBM Syst. J..

[26]  Philipp Koehn,et al.  Margin Infused Relaxed Algorithm for Moses , 2011, Prague Bull. Math. Linguistics.

[27]  Tingting He,et al.  An Approach to Automatically Constructing Domain Ontology , 2006, PACLIC.

[28]  Paola Velardi,et al.  TermExtractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities , 2007, IESA.

[29]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[30]  Nikola Ljubešić,et al.  Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages , 2012 .

[31]  Lee Gillam,et al.  University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER) , 1999, TREC.

[32]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[33]  Alexander F. Gelbukh,et al.  Automatic Term Extraction Using Log-Likelihood Based Comparison with General Reference Corpus , 2010, NLDB.

[34]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[35]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[36]  Branimir Boguraev,et al.  Automatic Glossary Extraction: Beyond Terminology Identification , 2002, COLING.

[37]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[38]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[39]  Ahmet Aker,et al.  Extracting bilingual terminologies from comparable corpora , 2013, ACL.

[40]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[41]  Gabriela Fernandez,et al.  Mutual Bilingual Terminology Extraction , 2008, LREC.

[42]  Takako Aikawa,et al.  Automatic validation of terminology translation consistenscy with statistical method , 2007, MTSUMMIT.

[43]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[44]  Paul Deane,et al.  A Nonparametric Method for Extraction of Candidate Phrasal Terms , 2005, ACL.

[45]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[46]  Benjamin Ka-Yin T'sou,et al.  Towards Bilingual Term Extraction in Comparable Patents , 2009, PACLIC.

[47]  Sophia Ananiadou,et al.  Extracting Nested Collocations , 1996, COLING.

[48]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.