Cross-Lingual Word Embeddings for Turkic Languages

There has been an increasing interest in learning cross-lingual word embeddings to transfer knowledge obtained from a resource-rich language, such as English, to lower-resource languages for which annotated data is scarce, such as Turkish, Russian, and many others. In this paper, we present the first viability study of established techniques to align monolingual embedding spaces for Turkish, Uzbek, Azeri, Kazakh and Kyrgyz, members of the Turkic family which is heavily affected by the low-resource constraint. Those techniques are known to require little explicit supervision, mainly in the form of bilingual dictionaries, hence being easily adaptable to different domains, including low-resource ones. We obtain new bilingual dictionaries and new word embeddings for these languages and show the steps for obtaining cross-lingual word embeddings using state-of-the-art techniques. Then, we evaluate the results using the bilingual dictionary induction task. Our experiments confirm that the obtained bilingual dictionaries outperform previously-available ones, and that word embeddings from a low-resource language can benefit from resource-rich closely-related languages when they are aligned together. Furthermore, evaluation on an extrinsic task (Sentiment analysis on Uzbek) proves that monolingual word embeddings can, although slightly, benefit from cross-lingual alignments.

[1]  Artur Kulmizev Multilingual word embeddings and their utility in cross-lingual learning , 2018 .

[2]  Narynov Sergazy Sakenovich,et al.  On One Approach of Solving Sentiment Analysis Task for Kazakh and Russian Languages Using Deep Learning , 2016, ICCCI.

[3]  Eneko Agirre,et al.  Generalizing and Improving Bilingual Word Embedding Mappings with a Multi-Step Framework of Linear Transformations , 2018, AAAI.

[4]  Gülsen Eryigit,et al.  Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content , 2017, Semantic Web.

[5]  Deniz Yuret,et al.  Learning Morphological Disambiguation Rules for Turkish , 2006, NAACL.

[6]  Eneko Agirre,et al.  Learning principled bilingual mappings of word embeddings while preserving monolingual invariance , 2016, EMNLP.

[7]  Georgiana Dinu,et al.  Hubness and Pollution: Delving into Cross-Space Mapping for Zero-Shot Learning , 2015, ACL.

[8]  Steven Schockaert,et al.  Meemi: A Simple Method for Post-processing Cross-lingual Word Embeddings , 2019, ArXiv.

[9]  Reyyan Yeniterzi Exploiting Morphology in Turkish Named Entity Recognition System , 2011, ACL.

[10]  Stephanie Strassel,et al.  Uzbek-English and Turkish-English Morpheme Alignment Corpora , 2016, LREC.

[11]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[12]  Steven Schockaert,et al.  On the Robustness of Unsupervised and Semi-supervised Cross-lingual Word Embedding Learning , 2020, LREC.

[13]  Bahar Karaoglan,et al.  A Suffix Based Part-of-Speech Tagger for Turkish , 2008, Fifth International Conference on Information Technology: New Generations (itng 2008).

[14]  Kemal Oflazer,et al.  Turkish Natural Language Processing , 2018, Theory and Applications of Natural Language Processing.

[15]  Francis M. Tyers,et al.  A Free/Open-source Kazakh-Tatar Machine Translation System , 2013, MTSUMMIT.

[16]  Kemal Oflazer,et al.  Dependency Parsing of Turkish , 2008, CL.

[17]  Ali Abbasov,et al.  Peculiarities of the development of the dictionary for the MT system from Azerbaijani , 2008, EAMT.

[18]  Samuel L. Smith,et al.  Offline bilingual word vectors, orthogonal transformations and the inverted softmax , 2017, ICLR.

[19]  Deniz Yuret,et al.  Transfer Learning for Low-Resource Neural Machine Translation , 2016, EMNLP.

[20]  Victor O. K. Li,et al.  Universal Neural Machine Translation for Extremely Low Resource Languages , 2018, NAACL.

[21]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[22]  Ali Abbasov,et al.  HMM-BASED LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION SYSTEM FOR AZERBAIJANI , 2010 .

[23]  Altynbek Sharipbay,et al.  Sentiment analysis on the hotel reviews in the Kazakh language , 2017, 2017 International Conference on Computer Science and Engineering (UBMK).

[24]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25]  Anders Søgaard,et al.  On the Limitations of Unsupervised Bilingual Dictionary Induction , 2018, ACL.

[26]  Regina Barzilay,et al.  Ten Pairs to Tag – Multilingual POS Tagging via Coarse Mapping between Embeddings , 2016, NAACL.

[27]  Deniz Yuret,et al.  Morphological Analysis Using a Sequence Decoder , 2019, Transactions of the Association for Computational Linguistics.

[28]  Vít Baisa,et al.  Large Corpora for Turkic Languages and UnsupervisedMorphological Analysis , 2012 .

[29]  Zygmunt Vetulani,et al.  Representation of Uzbek Morphology in Prolog , 2009, Aspects of Natural Language Processing.

[30]  Trevor Cohn,et al.  Model Transfer for Tagging Low-resource Languages using a Bilingual Dictionary , 2017, ACL.

[31]  Kemal Oflazer,et al.  The Turkish Treebank , 2018 .

[32]  A G N,et al.  Bibliographical References , 1965 .

[33]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[35]  Ahmet Üstün,et al.  Turkish PoS Tagging by Reducing Sparsity with Morpheme Tags in Small Datasets , 2016, CICLing.

[36]  Francis M. Tyers,et al.  A finite-state morphological transducer for Kyrgyz , 2012, LREC.

[37]  Ali M. Abbasov,et al.  Set of active suffix chains and its role in development of the MT system for Azerbaijani , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[38]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[39]  Anders Søgaard,et al.  A Survey of Cross-lingual Word Embedding Models , 2017, J. Artif. Intell. Res..

[40]  Gülsen Eryigit The Impact of Automatic Morphological Analysis & Disambiguation on Dependency Parsing of Turkish , 2012, LREC.

[41]  Tomoko Ohkuma,et al.  Sentiment Analysis for Low Resource Languages: A Study on Informal Indonesian Tweets , 2016, ALR@COLING.

[42]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[43]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[44]  Murat Saraclar,et al.  Resources for Turkish morphological processing , 2011, Lang. Resour. Evaluation.

[45]  Graham Neubig,et al.  Cross-Lingual Word Embeddings for Low-Resource Language Modeling , 2017, EACL.

[46]  Goran Glavas,et al.  How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions , 2019, ACL.