Cognate Production using Character-based Machine Translation

Cognates are words in different languages that are associated with each other by language learners. Thus, cognates are important indicators for the prediction of the perceived difficulty of a text. We introduce a method for automatic cognate production using character-based machine translation. We show that our approach is able to learn production patterns from noisy training data and that it works for a wide range of language pairs. It even works across different alphabets, e.g. we obtain good results on the tested language pairs English-Russian, English-Greek, and English-Farsi. Our method performs significantly better than similarity measures used in previous work on cognates.

[1]  George W. Adamson,et al.  The use of an association measure based on character structure to identify semantically related pairs of words and document titles , 1974, Inf. Storage Retr..

[2]  David Crystal,et al.  A dictionary of linguistics and phonetics , 1997 .

[3]  Håkan Ringbom On L1 Transfer in L2 Comprehension and L2 Production , 1992 .

[4]  S. Carroll On cognates , 1992 .

[5]  Chris Brew,et al.  Word-Pair Extraction for Lexicography , 1996 .

[6]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[7]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[8]  G. Dias,et al.  Cognates alignment , 2001, MTSUMMIT.

[9]  Philipp Koehn,et al.  Learning a Translation Lexicon from Monolingual Corpora , 2002, ACL 2002.

[10]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[11]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[12]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[13]  Stefan Schulz,et al.  Cognate Mapping - A Heuristic Strategy for the Semi-Supervised Acquisition of a Spanish Lexicon from a Portuguese Seed Lexicon , 2004, COLING.

[14]  Grzegorz Kondrak,et al.  Identification of Confusable Drug Names: A New Approach and Evaluation Methodology , 2004, COLING.

[15]  Diana Inkpen,et al.  Automatic Identification of Cognates and False Friends in French and English , 2005 .

[16]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[17]  Svetlin Nakov Sofia Cognate or False Friend ? Ask the Web ! , 2007 .

[18]  Viktor Pekar,et al.  Methods for extracting and classifying pairs of cognates and false friends , 2008, Machine Translation.

[19]  Eiichiro Sumita,et al.  Phrase-based Machine Transliteration , 2008, IJCNLP.

[20]  Gerhard Weikum,et al.  Towards a universal wordnet by learning from combined evidence , 2009, CIKM.

[21]  Jörg Tiedemann,et al.  Character-Based PSMT for Closely Related Languages , 2009, EAMT.

[22]  Karthik Gali,et al.  Modeling Machine Transliteration as a Phrase Based Statistical Machine Translation Problem , 2009, NEWS@IJCNLP.

[23]  Sandra M. Aluísio,et al.  Using machine learning methods to avoid the pitfall of cognates and false friends in Spanish-Portuguese word pairs , 2011, STIL.

[24]  Sara Stymne,et al.  Spell Checking Techniques for Replacement of Unknown Words and Data Cleaning for Haitian Creole SMS Translation , 2011, WMT@EMNLP.

[25]  José Gabriel Pereira Lopes,et al.  Measuring Spelling Similarity for Cognate Identification , 2011, EPIA.

[26]  Yang Liu,et al.  A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations , 2011, IJCNLP.

[27]  Iryna Gurevych,et al.  UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF , 2012, EACL.

[28]  Soto Montalvo,et al.  Automatic cognate identification based on a fuzzy combination of string similarity measures , 2012, 2012 IEEE International Conference on Fuzzy Systems.

[29]  Preslav Nakov,et al.  Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages , 2012, ACL.