Learning Word Vectors for 157 Languages

Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

[1]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[2]  Taku Kudo,et al.  MeCab : Yet Another Part-of-Speech and Morphological Analyzer , 2005 .

[3]  Rada Mihalcea,et al.  Using Wikipedia for Automatic Word Sense Disambiguation , 2007, NAACL.

[4]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[5]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[6]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[7]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[8]  Peng Jin,et al.  SemEval-2012 Task 4: Evaluating Chinese Word Similarity , 2012, SemEval@NAACL-HLT.

[9]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Steven Skiena,et al.  Polyglot: Distributed Word Representations for Multilingual NLP , 2013, CoNLL.

[14]  Kenneth Heafield,et al.  N-gram Counts and Language Models from the Common Crawl , 2014, LREC.

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Zhiyuan Liu,et al.  Joint Learning of Character and Word Embeddings , 2015, IJCAI.

[17]  Giacomo Berardi,et al.  Word Embeddings Go to Italy: A Comparison of Models and Training Datasets , 2015, IIR.

[18]  Sabine Schulte im Walde,et al.  Multilingual Reliability and “Semantic” Structure of Continuous Word Spaces , 2015, IWCS.

[19]  Tomas Brychcin,et al.  New word analogy corpus for exploring embeddings of Czech words , 2016, CICLing.

[20]  Anh-Cuong Le,et al.  A hybrid approach to Vietnamese word segmentation , 2016, 2016 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF).

[21]  Jouko Vankka,et al.  Finnish resources for evaluating language model semantics , 2017, NODALIDA.

[22]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[23]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[24]  Nathan Hartmann,et al.  Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks , 2017, STIL.

[25]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.