Duplicate Detection with PMC -- A Parallel Approach to Pattern Matching

Fuzzy duplicate detection is an integral part of data cleansing. It consists of finding a set of duplicate records, correctly identifying the original or most representative record and removing the rest. The rate of Internet usage, and data availability and collectability is increasing so we get more and more access to data. A lot of this data is collected from, and entered by humans and this causes noise in the data from typing mistakes, spelling discrepancies, varying schemas, abbreviations, and more. Because of this data cleansing and approximate duplicate detection is now more important than ever. In fuzzy matching records are usually compared by measuring the edit distance between two records. This leads to problems with large data sets where there is a lot of record comparisons to be made so previous solutions have found ways to cut down on the amount of records to be compared. This is often done by creating a key which records are then sorted on with the intention of placing similar records near each other. There are several downsides to this, for example you need to sort and search through potentially large amounts of data several times to catch duplicate data accurately. This project differs in that it presents an approach to the problem which takes advantage of a multiple instruction stream, multiple data stream (MIMD) architecture called a Pattern Matching Chip (PMC), which allows large amounts of parallel character comparisons. This will allow you to do fuzzy matching against the entire data set very quickly, removing the need for clustering and re-arranging of the data which can often lead to omitted duplicates (false negatives). The main point of this paper will be to test the viability of this approach for duplicate detection, examining the performance, potential and scalability of the approach.

[1]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[2]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[3]  Raffaele Giancarlo,et al.  Data structures and algorithms for approximate string matching , 1988, J. Complex..

[4]  Howard B. Newcombe,et al.  Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[5]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[6]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[7]  Jordan Lampe,et al.  Theoretical and Empirical Comparisons of Approximate String Matching Algorithms , 1992, CPM.

[8]  Zhengxin Chen,et al.  Duplicate detection using k-way sorting method , 2000, SAC '00.

[9]  Hongjun Lu,et al.  Cleansing Data for Mining and Warehousing , 1999, DEXA.

[10]  D. A. Thompson,et al.  The Future of Magnetic Data Storage Technology , 2000 .

[11]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[12]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[13]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[14]  M. W. Du,et al.  An Approach to Designing Very Fast Approximate String Matching Algorithms , 1994, IEEE Trans. Knowl. Data Eng..

[15]  Olaf René Birkeland,et al.  A recursive MISD architecture for pattern matching , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[16]  Joseph M. Hellerstein,et al.  Potter''s Wheel: An Interactive Framework for Data Transformation and Cleaning , 2001, VLDB 2001.

[17]  A. A. Brooks,et al.  Experiment in computer-assisted duplicate checking , 1976 .

[18]  Olaf René Birkeland,et al.  The Petacomp Machine - A MIMD Cluster for Parallel Pattern-mining , 2006, 2006 IEEE International Conference on Cluster Computing.

[19]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.