An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures

Kate Gleason College of Engineering Center for Quality and Applied Statistics Master’s of Science by Salha Hassan Muhammed Qahl Is there any similarity between the contexts of the Holy Bible and the Holy Quran, and can this be proven mathematically? The purpose of this research is using the Bible and the Quran as our corpus, we explore the performance of various feature extraction and machine learning techniques. The unstructured nature of text data adds an extra layer of complexity in the feature extraction task, and the inherently sparse nature of the corresponding data matrices makes text mining a distinctly difficult task. Among other things, We assess the difference between domain-based syntactic feature extraction and domain-free feature extraction, and then use a variety of similarity measures like Euclidean, Hillinger, Manhattan, cosine, Bhattacharyya, symmetries kullback-leibler, Jensen Shannon, probabilistic chi-square and clark. For a similarity to identify similarities and differences between sacred texts. Initially I started by comparing chapters of two raw text using the proximity measures to visualize their behaviors on high dimensional and spars space. It was apparent there was similarity between some of the chapters, but it was not conclusive. Therefore, there was a need to clean the noise using the so called Natural Language processing (NLP). For example, to minimize the size of two vectors, We initiated lists of similar vocabulary that worded differently in both texts but indicates the same exact meaning. Therefore, the program would recognize Lord as God in the Holy Bible and Allah as God in the Quran and Jacob as prophet in bible and Yaqub as a prophet in Quran. This process was completed many times to give relative comparisons on a variety of different words. After completion of the comparison of the raw texts, the comparison was completed for the processed text. The next comparison was completed using probabilistic topic modeling on feature extracted matrix to project the topical matrix into low dimensional space for more dense comparison. Among the distance measures intrdued to the sacred corpora, the analysis of similarities based on the probability based measures like Kullback leibler and Jenson shown the best result. Another similarity result based on Hellinger distance on the CTM also shows good discrimination result between documents. This work started with a believe that if there is intersection between Bible and Quran, it will be shown clearly between the book of Deuteronomy and some Quranic chapters. It is now not only historically, but also mathematically is correct to say that there is much similarity between the Biblical and Quranic contexts more than the similarity within the holy books themselves. Furthermore, it is the conclusion that distances based on probabilistic measures such as Jeffersyn divergence and Hellinger distance are the recommended methods for the unstructured sacred texts.

[1]  Elena Deza,et al.  Encyclopedia of Distances , 2014 .

[2]  Jean Thioulouse,et al.  The ade4 package - I : One-table methods , 2004 .

[3]  O. Eissfeldt The old Testament , 1965 .

[4]  M. Pickthall,et al.  The Meaning of the Glorious Koran , 1930 .

[5]  Richard Sproat,et al.  Mining correlated bursty topic patterns from coordinated text streams , 2007, KDD '07.

[6]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[7]  T.S. Perry Thomas Kailath , 2007, IEEE Spectrum.

[8]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[9]  T. Kailath The Divergence and Bhattacharyya Distance Measures in Signal Selection , 1967 .

[10]  Elena Deza,et al.  Dictionary of distances , 2006 .

[11]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[12]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[13]  Kurt Hornik,et al.  topicmodels : An R Package for Fitting Topic Models , 2016 .

[14]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[15]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Milad Shokouhi,et al.  Advances in Information Retrieval Theory, Second International Conference on the Theory of Information Retrieval, ICTIR 2009, Cambridge, UK, September 10-12, 2009, Proceedings , 2009, ICTIR.

[17]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[18]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[19]  Bo Zhang,et al.  Scalable Inference for Logistic-Normal Topic Models , 2013, NIPS.

[20]  John Elder,et al.  Handbook of Statistical Analysis and Data Mining Applications , 2009 .

[21]  Filippo Menczer,et al.  Algorithmic detection of semantic similarity , 2005, WWW '05.

[22]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[23]  L. Moldawer,et al.  MyD88-dependent expansion of an immature GR-1+CD11b+ population induces T cell suppression and Th2 polarization in sepsis , 2007, The Journal of experimental medicine.

[24]  Noah A. Smith,et al.  Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction , 2008, NIPS.

[25]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[26]  Shamik Sural,et al.  Similarity between Euclidean and cosine angle distance for nearest neighbor queries , 2004, SAC '04.

[27]  K Stevens,et al.  The C-Cat Wordnet Package: An Open Source Package for modifying andapplying Wordnet , 2011 .

[28]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[29]  Peter A. Chew,et al.  Term Weighting Schemes for Latent Dirichlet Allocation , 2010, NAACL.

[30]  Thorsten Joachims,et al.  Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[31]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[32]  Liping Han,et al.  Distance Weighted Cosine Similarity Measure for Text Classification , 2013, IDEAL.

[33]  Thomas Hofmann,et al.  Latent semantic models for collaborative filtering , 2004, TOIS.

[34]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[35]  Ryen W. White,et al.  A study of topic similarity measures , 2004, SIGIR '04.

[36]  M. C. Hyers The Meaning of Creation: Genesis and Modern Science , 1984 .

[37]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[38]  Jürgen Weese,et al.  A comparison of similarity measures for use in 2-D-3-D medical image registration , 1998, IEEE Transactions on Medical Imaging.

[39]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[40]  Sarah C. Goslee,et al.  The ecodist Package for Dissimilarity-based Analysis of Ecological Data , 2007 .

[41]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[42]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[43]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[44]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[45]  Katarzyna Musial,et al.  On Accuracy of PDF Divergence Estimators and Their Applicability to Representative Data Sampling , 2011, Entropy.

[46]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[47]  S. Jonjić,et al.  Site-restricted persistent cytomegalovirus infection after selective long-term depletion of CD4+ T lymphocytes , 1989, The Journal of experimental medicine.

[48]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[49]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[50]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .