Visualizing document similarity using n-grams and latent semantic analysis

As the number of information resources and document quantity explodes, efficient tools with intuitive visualization capabilities desperately needed to assist users in conducting document similarity analysis and/or plagiarism detection tasks by discovering hidden relations among documents. This paper proposes a content-based method for document similarity analysis and visualization. The proposed method is based on modeling the relationship between documents and their n-gram phrases, which are generated from the normalized text, exploiting morphology analysis and lexical lookup. Resolving possible morphological ambiguities is carried out by tagging the words within the examined documents. Text indexing and stop-words removal are performed, employing a new technique that is efficient in dealing with multiple long documents. The examined documents' TF-IDF model is constructed using heuristic based pair-wise matching algorithm, considering lexical and syntactic changes. Then, the hidden associations between the documents and their unique n-gram phrases are investigated using Latent Semantic Analysis (LSA). Next, the pairwise document subset and similarity measures are derived from the Singular Value Decomposition (SVD) computations. Different visualization techniques are then applied on the SVD results to expose the hidden relations among the documents under consideration. As Arabic is one of the most morphological and complicated languages, this paper emphasizes Arabic documents similarity analysis and visualization. Various experiments were carried out revealing the strong capabilities of the proposed method in analyzing and visualizing literal and some types of intelligent similarities.

[1]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[2]  Mohamed El Bachir Menai,et al.  Detection of Plagiarism in Arabic Documents , 2012 .

[3]  M. Mozgovoy The Use of Machine Semantic Analysis in Plagiarism Detection , 2006 .

[4]  R. Tesar,et al.  Teraman: A Tool for N-gram Extraction from Large Datasets , 2007, 2007 IEEE International Conference on Intelligent Computer Communication and Processing.

[5]  Ibrahim Abu El-Khair,et al.  Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study , 2017, ArXiv.

[6]  Baoyao Zhou,et al.  Document visualization: an overview of current research , 2014 .

[7]  M. D. Martínez-Miranda,et al.  Computational Statistics and Data Analysis , 2009 .

[8]  Gz,et al.  命运多舛——MicroSoft Office传奇 , 2006 .

[9]  Tuomo Kakkonen,et al.  Automatic Student Plagiarism Detection: Future Perspectives , 2010 .

[10]  Zachary Estes,et al.  Using Latent Semantic Analysis to Estimate Similarity , 2006 .

[11]  Ashraf Elnagar,et al.  A Plagiarism Detection System for Arabic Text-Based Documents , 2012, PAISI.

[12]  Fionn Murtagh,et al.  Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? , 2011, Journal of Classification.

[13]  Jonathan D. Cohen,et al.  Visualizing document classification: A search aid for the digital library , 1998, Journal of the American Society for Information Science.

[14]  Ashraf Saad Hussein A Plagiarism Detection System for Arabic Documents , 2014, IEEE Conf. on Intelligent Systems.

[15]  Lalit Agarwal,et al.  Multilingual Plagiarism Detection , 2014 .

[16]  F. H. SUMNER Computing Conference , 1968, Nature.

[17]  Benno Stein,et al.  Plagiarism Detection Without Reference Collections , 2006, GfKl.

[18]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[19]  Ibrahim A. Al-Kharashi,et al.  Arabic morphological analysis techniques: A comprehensive survey , 2004, J. Assoc. Inf. Sci. Technol..

[20]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[21]  Zdenek Ceska,et al.  Plagiarism Detection Based on Singular Value Decomposition , 2008, GoTAL.

[22]  Mohamed El Bachir Menai,et al.  APlag: A plagiarism checker for Arabic texts , 2011, 2011 6th International Conference on Computer Science & Education (ICCSE).

[23]  Erkki Sutinen,et al.  Using natural language parsers in plagiarism detection , 2007, SLaTE.

[24]  Ashraf S. Hussein Arabic document similarity analysis using n-grams and singular value decomposition , 2015, 2015 IEEE 9th International Conference on Research Challenges in Information Science (RCIS).

[25]  Norman Meuschke,et al.  State-of-the-art in detecting academic plagiarism , 2013 .

[26]  Naomie Salim,et al.  Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[27]  Muazzam Ahmed Siddiqui,et al.  Query Optimization in Arabic Plagiarism Detection: An Empirical Study , 2014 .

[28]  Stefan Gruner,et al.  Tool support for plagiarism detection in text documents , 2005, SAC '05.

[29]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..

[30]  Naomie Salim,et al.  Plagiarism detection in arabic scripts using fuzzy information retreival , 2008 .

[31]  Paolo Rosso,et al.  Intrinsic Plagiarism Detection in Arabic Text: Preliminary Experiments , 2012 .

[32]  Chris Fox,et al.  The Influence of Text Pre-processing on Plagiarism Detection , 2009, RANLP.

[33]  Shou-De Lin,et al.  Online plagiarism detection through exploiting lexical, syntactic, and semantic information , 2012, ACL 2012.

[34]  Yuen-Yan Chan,et al.  A natural language processing approach to automatic plagiarism detection , 2007, SIGITE '07.

[35]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[36]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[37]  Eshetie Berhan,et al.  Text Similarity Based on Data Compression in Arabic , 2014 .

[38]  Xiao-Dong Liu,et al.  A fast document copy detection model , 2006, Soft Comput..