“A Passage to India”: Pre-trained Word Embeddings for Indian Languages

Dense word vectors or ‘word embeddings’ which encode semantic properties of words, have now become integral to NLP tasks like Machine Translation (MT), Question Answering (QA), Word Sense Disambiguation (WSD), and Information Retrieval (IR). In this paper, we use various existing approaches to create multiple word embeddings for 14 Indian languages. We place these embeddings for all these languages, viz., Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani, Malayalam, Marathi, Nepali, Odiya, Punjabi, Sanskrit, Tamil, and Telugu in a single repository. Relatively newer approaches that emphasize catering to context (BERT, ELMo, etc.) have shown significant improvements, but require a large amount of resources to generate usable models. We release pre-trained embeddings generated using both contextual and non-contextual approaches. We also use MUSE and XLM to train cross-lingual embeddings for all pairs of the aforementioned languages. To show the efficacy of our embeddings, we evaluate our embedding models on XPOS, UPOS and NER tasks for all these languages. We release a total of 436 models using 8 different approaches. We hope they are useful for the resource-constrained Indian language NLP. The title of this paper refers to the famous novel “A Passage to India” by E.M. Forster, published initially in 1924.

[1]  Eneko Agirre,et al.  Unsupervised Statistical Machine Translation , 2018, EMNLP.

[2]  Pushpak Bhattacharyya,et al.  Improving NER Tagging Performance in Low-Resource Languages via Multilingual Learning , 2018, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Samar Haider,et al.  Urdu Word Embeddings , 2018, LREC.

[5]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[6]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[7]  Ido Dagan,et al.  context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[8]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9]  Eneko Agirre,et al.  An Effective Approach to Unsupervised Machine Translation , 2019, ACL.

[10]  Ondrej Bojar,et al.  HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation , 2014, LREC.

[11]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[12]  Omer Levy,et al.  Zero-Shot Relation Extraction via Reading Comprehension , 2017, CoNLL.

[13]  Narayan Choudhary,et al.  Creating Multilingual Parallel Corpora in Indian Languages , 2011, LTC.

[14]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[15]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[16]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[17]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[18]  Paola Merlo,et al.  Cross-Lingual Word Embeddings and the Structure of the Human Bilingual Lexicon , 2019, CoNLL.

[19]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[20]  Nick Craswell,et al.  Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.

[21]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[22]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[23]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[24]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[25]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[26]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[27]  Jason Weston,et al.  Question Answering with Subgraph Embeddings , 2014, EMNLP.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.