The Comparative Analysis of Smith-Waterman Algorithm with Jaro-Winkler Algorithm for the Detection of Duplicate Health Related Records

Duplicate detection is a process of identifying a pair of words that refers to the same real-word object. Generally, words consist of letters that have a syntax representation. In most cases, words, such as names, are incorrectly spelt during data entry and that creates duplicate data and if it is unresolved could lead to inc onsistency of data. Fundamental algorithms that are applied in the design of duplicate detection systems includes Smith-Waterman and Jaro-Winkler algorithms. The study compares and analyses the application of Smith-Waterman algorithm and Jaro-Winkler algorithm to find duplicate words in large dataset such as health dataset. The basis for comparison is to find how accurate these algorithms are in detecting duplicate words in large health dataset. The contribution of this paper is the use of transitive and symmetry property on both Smith-Waterman and Jaro-Winkler algorithm when large dataset is involved in the duplicate detection processes

[1]  S. Vijayarani,et al.  Preprocessing Techniques for Text Mining-An Overview Dr , 2015 .

[2]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[3]  E. G. Shpaer,et al.  Sensitivity and selectivity in protein similarity searches: a comparison of Smith-Waterman in hardware to BLAST and FASTA. , 1996, Genomics.

[4]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[7]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[8]  Hongjun Lu,et al.  An n-gram-based approach for detecting approximately duplicate database records , 2002, International Journal on Digital Libraries.

[9]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[10]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[11]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[12]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[13]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[14]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[15]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[16]  Graham A Stephen,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[17]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[18]  William R Pearson,et al.  BLAST and FASTA similarity searching for multiple sequence alignment. , 2014, Methods in molecular biology.

[19]  Erik-André Sauleau,et al.  Medical record linkage in health information systems by approximate string matching and clustering , 2005, BMC Medical Informatics Decis. Mak..

[20]  Alvaro E. Monge,et al.  AN ADAPTIVE AND EFFICIENT ALGORITHM FOR DETECTING APPROXIMATELY DUPLICATE DATABASE RECORDS , 2007 .

[21]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[22]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[23]  William E. Yancey Evaluating String Comparator Performance for Record Linkage , 2005 .

[24]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[25]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .