Persian Word Embedding Evaluation Benchmarks

Recently, there has been renewed interest in semantic word representation also called word embedding, in a wide variety of natural language processing tasks requiring sophisticated semantic and syntactic information. The quality of word embedding methods is usually evaluated based on English language benchmarks. Nevertheless, only a few studies analyze word embedding for low resource languages such as Persian. In this paper, we perform such an extensive word embedding evaluation in Persian language based on a set of lexical semantics tasks named analogy, concept categorization, and word semantic relatedness. For these evaluation tasks, we provide three benchmark data sets to show the strengths and weakness of five well-known embedding models which are trained on Wikipedia corpus. The experimental results indicates that FastText (sg) and Word2Vec(cbow) outperform other models.

[1]  Quan Pan,et al.  Learning Word Representations for Sentiment Analysis , 2017, Cognitive Computation.

[2]  Utpal Garain,et al.  Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language , 2017, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[3]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[4]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[6]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[7]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[8]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[9]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[10]  Xiaoqing Zheng Incremental Graph-based Neural Dependency Parsing , 2017, EMNLP.

[11]  Shafiq R. Joty,et al.  Fine-grained Opinion Mining with Recurrent Neural Networks and Word Embeddings , 2015, EMNLP.

[12]  Nigel Collier,et al.  SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity , 2017, *SEMEVAL.

[13]  Sheng Zhang,et al.  MT/IE: Cross-lingual Open Information Extraction with Neural Sequence-to-Sequence Models , 2017, EACL.

[14]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[15]  Zhe Gan,et al.  Learning Generic Sentence Representations Using Convolutional Neural Networks , 2016, EMNLP.

[16]  Jimmy J. Lin,et al.  Experiments with Convolutional Neural Network Models for Answer Selection , 2017, SIGIR.

[17]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[18]  Benoît Favre,et al.  Word Embedding Evaluation and Combination , 2016, LREC.

[19]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  Peter D. Turney Domain and Function: A Dual-Space Model of Semantic Relations and Compositions , 2012, J. Artif. Intell. Res..

[22]  Devdatt P. Dubhashi,et al.  Extractive Summarization using Continuous Vector Space Models , 2014, CVSC@EACL.

[23]  Arthur C. Graesser,et al.  Strengths, Limitations, and Extensions of LSA , 2007 .

[24]  Yi Yang,et al.  Data-Driven Answer Selection in Community QA Systems , 2017, IEEE Transactions on Knowledge and Data Engineering.

[25]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..