Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages

Cognates are variants of the same lexical form across different languages; for example “fonema” in Spanish and “phoneme” in English are cognates, both of which mean “a unit of sound”. The task of automatic detection of cognates among any two languages can help downstream NLP tasks such as Cross-lingual Information Retrieval, Computational Phylogenetics, and Machine Translation. In this paper, we demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian Languages. Our approach introduces the use of context from a knowledge graph to generate improved feature representations for cognate detection. We then evaluate the impact of our cognate detection mechanism on neural machine translation (NMT), as a downstream task. We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages, namely, Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. Additionally, we create evaluation datasets for two more Indian languages, Konkani and Nepali1. We observe an improvement of up to 18% points, in terms of F-score, for cognate detection. Furthermore, we observe that cognates extracted using our method help improve NMT quality by up to 2.76 BLEU. We also release2 our code, newly constructed datasets and cross-lingual models publicly.

[1]  Grzegorz Kondrak,et al.  Identifying Cognates by Phonetic and Semantic Similarity , 2001, NAACL.

[2]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[5]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[6]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[7]  Grzegorz Kondrak,et al.  Multiple Word Alignment with Profile Hidden Markov Models , 2009, HLT-NAACL.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Taraka Rama Siamese Convolutional Networks for Cognate Identification , 2016, COLING.

[10]  Liviu P. Dinu,et al.  Automatic Detection of Cognates Using Orthographic Alignment , 2014, ACL.

[11]  Berlin Chen,et al.  Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Satoshi Nakamura,et al.  Incorporating Discrete Translation Lexicons into Neural Machine Translation , 2016, EMNLP.

[14]  Liviu P. Dinu,et al.  Automatic Discrimination between Cognates and Borrowings , 2015, ACL.

[15]  Ondrej Bojar,et al.  HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation , 2014, LREC.

[16]  Gerhard Jäger,et al.  Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists , 2017, EACL.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Johann-Mattis List,et al.  LexStat: Automatic Detection of Cognates in Multilingual Wordlists , 2012, EACL 2012.

[19]  K. Saravanan,et al.  "They Are Out There, If You Know Where to Look": Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval , 2009, ECIR.

[20]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[21]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[22]  Gholamreza Haffari,et al.  Cognate Identification to improve Phylogenetic trees for Indian Languages , 2019, COMAD/CODS.

[23]  Guodong Zhou,et al.  Explicitly Modeling Word Translations in Neural Machine Translation , 2019, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[24]  Liviu P. Dinu,et al.  Studying Laws of Semantic Divergence across Languages using Cognate Sets , 2019 .

[25]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[26]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[27]  John Nerbonne,et al.  Measuring Dialect Distance Phonetically , 1997, SIGMORPHON@EACL.

[28]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[29]  Diana Inkpen,et al.  Identification and Disambiguation of Cognates, False Friends, and Partial Cognates Using Machine Learning Techniques , 2010 .

[30]  Girish Nath Jha The TDIL Program and the Indian Langauge Corpora Intitiative (ILCI) , 2010, LREC.

[31]  Steven Schockaert,et al.  Improving Cross-Lingual Word Embeddings by Meeting in the Middle , 2018, EMNLP.

[32]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[33]  Eneko Agirre,et al.  Advances in Multilingual and Multimodal Information Retrieval. , 2008 .

[34]  Taraka Rama,et al.  Are Automatic Methods for Cognate Detection Good Enough for Phylogenetic Reconstruction in Historical Linguistics? , 2018, NAACL.

[35]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[36]  Gholamreza Haffari,et al.  Challenge Dataset of Cognates and False Friend Pairs from Indian Languages , 2020, LREC.

[37]  Paola Merlo,et al.  Cross-Lingual Word Embeddings and the Structure of the Human Bilingual Lexicon , 2019, CoNLL.

[38]  D. R. McGregor,et al.  Fast approximate string matching , 1988, Softw. Pract. Exp..

[39]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[40]  Grzegorz Kondrak,et al.  Integrating Joint n-gram Features into a Discriminative Training Framework , 2010, HLT-NAACL.

[41]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[42]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[43]  David Crystal,et al.  A dictionary of linguistics and phonetics , 1997 .

[44]  Daniel Marcu,et al.  Cognates Can Improve Statistical Translation Models , 2003, NAACL.

[45]  Grzegorz Kondrak Cognates and Word Alignment in Bitexts , 2005, MTSUMMIT.

[46]  David Yarowsky,et al.  Multipath Translation Lexicon Induction via Bridge Languages , 2001, NAACL.

[47]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[48]  Pushpak Bhattacharyya,et al.  IndoWordNet , 2010, LREC.

[49]  Jörg Tiedemann,et al.  Automatic Construction of Weighted String Similarity Measures , 1999, EMNLP.

[50]  Viktor Pekar,et al.  Automatic Detection of Orthographics Cues for Cognate Recognition , 2006, LREC.

[51]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[52]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[53]  R. Baayen,et al.  How cross-language similarity and task demands affect cognate recognition , 2010 .