论文信息 - An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures

An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures

Kate Gleason College of Engineering Center for Quality and Applied Statistics Master’s of Science by Salha Hassan Muhammed Qahl Is there any similarity between the contexts of the Holy Bible and the Holy Quran, and can this be proven mathematically? The purpose of this research is using the Bible and the Quran as our corpus, we explore the performance of various feature extraction and machine learning techniques. The unstructured nature of text data adds an extra layer of complexity in the feature extraction task, and the inherently sparse nature of the corresponding data matrices makes text mining a distinctly difficult task. Among other things, We assess the difference between domain-based syntactic feature extraction and domain-free feature extraction, and then use a variety of similarity measures like Euclidean, Hillinger, Manhattan, cosine, Bhattacharyya, symmetries kullback-leibler, Jensen Shannon, probabilistic chi-square and clark. For a similarity to identify similarities and differences between sacred texts. Initially I started by comparing chapters of two raw text using the proximity measures to visualize their behaviors on high dimensional and spars space. It was apparent there was similarity between some of the chapters, but it was not conclusive. Therefore, there was a need to clean the noise using the so called Natural Language processing (NLP). For example, to minimize the size of two vectors, We initiated lists of similar vocabulary that worded differently in both texts but indicates the same exact meaning. Therefore, the program would recognize Lord as God in the Holy Bible and Allah as God in the Quran and Jacob as prophet in bible and Yaqub as a prophet in Quran. This process was completed many times to give relative comparisons on a variety of different words. After completion of the comparison of the raw texts, the comparison was completed for the processed text. The next comparison was completed using probabilistic topic modeling on feature extracted matrix to project the topical matrix into low dimensional space for more dense comparison. Among the distance measures intrdued to the sacred corpora, the analysis of similarities based on the probability based measures like Kullback leibler and Jenson shown the best result. Another similarity result based on Hellinger distance on the CTM also shows good discrimination result between documents. This work started with a believe that if there is intersection between Bible and Quran, it will be shown clearly between the book of Deuteronomy and some Quranic chapters. It is now not only historically, but also mathematically is correct to say that there is much similarity between the Biblical and Quranic contexts more than the similarity within the holy books themselves. Furthermore, it is the conclusion that distances based on probabilistic measures such as Jeffersyn divergence and Hellinger distance are the recommended methods for the unstructured sacred texts.

Salha Hassan Muhammed Qahl | S. Qahl

[1] Elena Deza,et al. Encyclopedia of Distances , 2014 .

[2] Jean Thioulouse,et al. The ade4 package - I : One-table methods , 2004 .

[3] O. Eissfeldt. The old Testament , 1965 .

[4] M. Pickthall,et al. The Meaning of the Glorious Koran , 1930 .

[5] Richard Sproat,et al. Mining correlated bursty topic patterns from coordinated text streams , 2007, KDD '07.

[6] Wei Li,et al. Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[7] T.S. Perry. Thomas Kailath , 2007, IEEE Spectrum.

[8] Mehran Sahami,et al. Text Mining: Classification, Clustering, and Applications , 2009 .

[9] T. Kailath. The Divergence and Bhattacharyya Distance Measures in Signal Selection , 1967 .

[10] Elena Deza,et al. Dictionary of distances , 2006 .

[11] Eric Brill,et al. Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.