Toward Better Loanword Identification in Uyghur Using Cross-lingual Word Embeddings

To enrich vocabulary of low resource settings, we proposed a novel method which identify loanwords in monolingual corpora. More specifically, we first use cross-lingual word embeddings as the core feature to generate semantically related candidates based on comparable corpora and a small bilingual lexicon; then, a log-linear model which combines several shallow features such as pronunciation similarity and hybrid language model features to predict the final results. In this paper, we use Uyghur as the receipt language and try to detect loanwords in four donor languages: Arabic, Chinese, Persian and Russian. We conduct two groups of experiments to evaluate the effectiveness of our proposed approach: loanword identification and OOV translation in four language pairs and eight translation directions (Uyghur-Arabic, Arabic-Uyghur, Uyghur-Chinese, Chinese-Uyghur, Uyghur-Persian, Persian-Uyghur, Uyghur-Russian, and Russian-Uyghur). Experimental results on loanword identification show that our method outperforms other baseline models significantly. Neural machine translation models integrating results of loanword identification experiments achieve the best results on OOV translation(with 0.5-0.9 BLEU improvements)

[1]  Nicholas D. Kontovas AN ANALYSIS of RECENT LOANS into the STANDARD UYGHUR LEXICON What Semantic Distribution & Phonological Interpretation Reveal about Transmission Environment , 2015 .

[2]  David Chiang,et al.  Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation , 2017, IJCNLP.

[3]  Anders Søgaard,et al.  A Survey of Cross-lingual Word Embedding Models , 2017, J. Artif. Intell. Res..

[4]  Yulia Tsvetkov,et al.  Cross-Lingual Bridges with Models of Lexical Borrowing , 2016, J. Artif. Intell. Res..

[5]  Yulia Tsvetkov,et al.  Constraint-Based Models of Lexical Borrowing , 2015, NAACL.

[6]  Guillaume Lample,et al.  Massively Multilingual Word Embeddings , 2016, ArXiv.

[7]  Xiao Li,et al.  Detection of Loan Words in Uyghur Texts , 2014, NLPCC.

[8]  Graham Neubig,et al.  Cross-Lingual Word Embeddings for Low-Resource Language Modeling , 2017, EACL.

[9]  Hiroshi Kanayama,et al.  Multilingual Training of Crosslingual Word Embeddings , 2017, EACL.

[10]  Trevor Cohn,et al.  Model Transfer for Tagging Low-resource Languages using a Bilingual Dictionary , 2017, ACL.

[11]  Alexander Sugar,et al.  Mandarin Chinese Verbs as Verbal Items in Uyghur Mixed Verbs , 2017 .

[12]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[13]  Tonghai Jiang,et al.  Recurrent Neural Network Based Loanwords Identification in Uyghur , 2016, PACLIC.

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[16]  Yulia Tsvetkov,et al.  Lexicon Stratification for Translating Out-of-Vocabulary Words , 2015, ACL.

[17]  Marie-Francine Moens,et al.  Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction , 2015, ACL.

[18]  Guillaume Lample,et al.  Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning , 2016, NAACL.

[19]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[20]  Sharon Peperkamp,et al.  A Psycholinguistic Theory of Loanword Adaptations , 2004 .

[21]  Shigeko Shinohara,et al.  Loanword-specific grammar in Japanese adaptations of Korean words and phrases , 2015 .

[22]  Zhi-Hong Deng,et al.  A Variational Autoencoding Approach for Inducing Cross-lingual Word Embeddings , 2017, IJCAI.

[23]  Yoonjung Kang,et al.  French loanwords in Vietnamese: the role of input language phonotactics and contrast in loanword adaptation , 2016 .

[24]  Ivan Titov,et al.  Inducing Crosslingual Distributed Representations of Words , 2012, COLING.

[25]  Quoc V. Le,et al.  Addressing the Rare Word Problem in Neural Machine Translation , 2014, ACL.