Evaluation Of Word Embeddings From Large-Scale French Web Content

Distributed word representations are popularly used in many tasks in natural language processing, adding that pre-trained word vectors on huge text corpus achieved high performance in many different NLP tasks. This paper introduces multiple high quality word vectors for the French language where two of them are trained on huge crawled French data and the others are trained on an already existing French corpus. We also evaluate the quality of our proposed word vectors and the existing French word vectors on the French word analogy task. In addition, we do the evaluation on multiple real NLP tasks that show the important performance enhancement of the pretrained word vectors compared to the existing and random ones. Finally, we created a demo web application to test and visualize the obtained word embeddings1. The produced French word embeddings are available to the public, along with the fine-tuning code on the NLU tasks2 and the demo code3.

[1]  Roberto Navigli,et al.  SemEval-2013 Task 12: Multilingual Word Sense Disambiguation , 2013, *SEMEVAL.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[4]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[5]  Benno Stein,et al.  Cross-Language Text Classification Using Structural Correspondence Learning , 2010, ACL.

[6]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[7]  Benjamin Lecouteux,et al.  Sense Vocabulary Compression through the Semantic Knowledge of WordNet for Neural Word Sense Disambiguation , 2019, GWC.

[8]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9]  Benjamin Lecouteux,et al.  FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[10]  Benoît Sagot,et al.  Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .

[11]  Jason Baldridge,et al.  PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification , 2019, EMNLP.

[12]  Michalis Vazirgiannis,et al.  Geometry and Analogies: A Study and Propagation Method for Word Representations , 2019, SLSP.

[13]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.