Classification of file duplication by hierarchical clustering based on similarity relations

This paper have proposed the classification of the duplicate file by measuring the similarity score between the couple of files. This work examined the distance between the pairwise of files by the Smith-Waterman algorithm. In addition, the make use of the Euclidean distance matrix could identify the relativity between the persons who often copies the files each other. Since the regularity of the duplication happens, this work could classify the proximity to the persons, and a group of person who positioned closely together by applying the hierarchical clustering. The result revealed that the Smith-Waterman algorithms could measure the similarity between files effectively. Also, this work could analyze the relativity of the persons, classifies the person who positioned closely together, and the person between nearest related members of the group. Finally, this work represented the amount of time that person duplicated the files.

[1]  Francisco Rosales,et al.  Detection of Plagiarism in Programming Assignments , 2008, IEEE Transactions on Education.

[2]  Chih-Hsun Chou,et al.  Hybrid genetic algorithm and fuzzy clustering for bankruptcy prediction , 2017, Appl. Soft Comput..

[3]  V.S. Tseng,et al.  Efficiently mining gene expression data via a novel parameterless clustering method , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  J. Howard Johnson,et al.  Identifying redundancy in source code using fingerprints , 1993, CASCON.

[5]  Liu Dongsheng,et al.  Preventing and Detecting Plagiarism in Programming Course , 2013 .

[6]  Fei Liu,et al.  Method for Determining the Optimal Number of Clusters Based on Agglomerative Hierarchical Clustering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[7]  Pasi Fränti,et al.  Minimum spanning tree based split-and-merge: A hierarchical clustering method , 2011, Inf. Sci..

[8]  Fionn Murtagh,et al.  Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? , 2011, Journal of Classification.

[9]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[10]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[11]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[12]  Manop Phankokkruad Efficient Similarity Measurement by the Combination of Distance Algorithms to Identify the Duplication Relativity , 2017 .

[13]  Bing Liu,et al.  An efficient semi-unsupervised gene selection method via spectral biclustering , 2006, IEEE Transactions on NanoBioscience.

[14]  Shinji Kusumoto,et al.  Gapped code clone detection with lightweight source code analysis , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[15]  Atul Prakash,et al.  A Framework for Source Code Search Using Program Patterns , 1994, IEEE Trans. Software Eng..

[16]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[17]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[18]  Hugo T. Jankowitz Detecting Plagiarism in Student Pascal Programs , 1988, Comput. J..

[19]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[20]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[21]  Bruno Pouliquen,et al.  Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC , 2002, CICLing.

[22]  Sally Dibb,et al.  Criteria guiding segmentation implementation: reviewing the evidence , 1999 .

[23]  Lior Rokach,et al.  Clustering Methods , 2005, The Data Mining and Knowledge Discovery Handbook.