On Finding Similar Verses from the Holy Quran using Word Embeddings

Finding semantic text similarity (STS) between two pieces of text is a well-known problem in Natural Language Processing. Its applications are nearly in every field such as plagiarism detection, finding related user queries in customer services or finding similar questions in search engines or forums like Stack Overflow, Quora and Stack exchange. If applied to any religious text, it can help to relate how similar pieces of knowledge are described in different places. This paper uses Word2Vec and Sent2Vec models to facilitate the process of knowledge extraction from a given corpus. The paper makes use of several English translations of the Holy Quran which is the most sacred book for Muslims. Sent2vec models have been trained from several translations of the book and the trained models are then subsequently utilized to study the semantic relationship between different words and sentences. The performance of the custom-built word embeddings is compared against the pre-trained embeddings provided by the Spacy library.

[1]  Mitchell P. Marcus,et al.  OntoNotes : A Large Training Corpus for Enhanced Processing , 2017 .

[2]  Mohd Juzaiddin Ab Aziz,et al.  A Question Answering System on Holy Quran Translation Based on Question Expansion Technique and Neural Network Classification , 2016, J. Comput. Sci..

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Kenneth Heafield,et al.  N-gram Counts and Language Models from the Common Crawl , 2014, LREC.

[5]  Miquel Sànchez-Marrè,et al.  ScoQAS: A Semantic-based Closed and Open Domain Question Answering System , 2017, Proces. del Leng. Natural.

[6]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  Mohammed Akour,et al.  MQVC: Measuring Quranic Verses Similarity and Sura Classification Using N-gram , 2014 .

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Salha Hassan Muhammed Qahl,et al.  An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures , 2014 .

[11]  A. Basharat,et al.  Comparative Study of Verse Similarity for Multi-lingual Representations of the Qur ’ an , 2016 .

[12]  Eric Atwell,et al.  QurSim: A corpus for evaluation of relatedness in short texts , 2012, LREC.

[13]  Didier Schwab,et al.  Semantic Similarity of Arabic Sentences with Word Embeddings , 2017, WANLP@EACL.