SAT Based Analogy Evaluation Framework for Persian Word Embeddings

In recent years there has been a special interest in word embeddings as a new approach to convert words to vectors. It has been a focal point to understand how much of the semantics of the the words has been transferred into embedding vectors. This is important as the embedding is going to be used as the basis for downstream NLP applications and it will be costly to evaluate the application end-to-end in order to identify quality of the used embedding model. Generally the word embeddings are evaluated through a number of tests, including analogy test. In this paper we propose a test framework for Persian embedding models. Persian is a low resource language and there is no rich semantic benchmark to evaluate word embedding models for this language. In this paper we introduce an evaluation framework including a hand crafted Persian SAT based analogy dataset, a colliquial test set (specific to Persian) and a benchmark to study the impact of various parameters on the semantic evaluation task.

[1]  Jila Ghomeshi 12 The associative plural and related constructions in Persian , 2018 .

[2]  Mohammad Hadi Bokaei,et al.  Persian Word Embedding Evaluation Benchmarks , 2018, Electrical Engineering (ICEE), Iranian Conference on.

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[5]  Jeffrey P. Bigham,et al.  Combining Independent Modules to Solve Multiple-choice Synonym and Analogy Problems , 2003, ArXiv.

[6]  Peter D. Turney The Latent Relation Mapping Engine: Algorithm and Experiments , 2008, J. Artif. Intell. Res..

[7]  Akbar Hesabi,et al.  Semi Automatic Development of FarsNet ; The Persian WordNet , 2009 .

[8]  Christoph Lofi Just ask a human? - Controlling Quality in Relational Similarity and Analogy Processing using the Crowd , 2013, BTW Workshops.

[9]  Hamzeh Moradi,et al.  A Contrastive Analysis of Persian and English Vowels and Consonants , 2018, Lege Artis.

[10]  Jeffrey P. Bigham,et al.  Combining independent modules in lexical multiple-choice problems , 2004, RANLP.

[11]  Saeedeh Momtazi,et al.  The impact of corpus domain on word representation: a study on Persian word embeddings , 2018, Lang. Resour. Evaluation.

[12]  Saif Mohammad,et al.  SemEval-2012 Task 2: Measuring Degrees of Relational Similarity , 2012, *SEMEVAL.

[13]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.