Clustering for Data Matching

The problem of matching data has as one of its major bottlenecks the rapid deterioration in performance of time and accuracy, as the amount of data to be processed increases. One reason for this deterioration in performance is the cost incurred by data matching systems when comparing data records to determine their similarity (or dissimilarity). Approaches such as blocking and concatenation of data attributes have been used to minimize the comparison cost. In this paper, we analyse and present Keyword and Digram clustering as alternatives for enhancing the performance of data matching systems. We compare the performance of these clustering techniques in terms of potential savings in performing comparisons and their accuracy in correctly clustering similar data. Our results on a sampled London Stock Exchange listed companies database show that using the clustering techniques can lead to improved accuracy as well as time savings in data matching systems.

[1]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[2]  Tommi S. Jaakkola,et al.  Fast optimal leaf ordering for hierarchical clustering , 2001, ISMB.

[3]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[4]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[5]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[6]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[7]  Peter Willett,et al.  Applications of n-grams in textual information systems , 1998, J. Documentation.

[8]  Wei-Min Shen,et al.  Data Preprocessing and Intelligent Data Analysis , 1997, Intell. Data Anal..

[9]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[10]  Ian H. Witten,et al.  Managing gigabytes 2nd edition , 1999 .

[11]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[12]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[13]  Kathleen McKeown,et al.  Translating Collocations for Use in Bilingual Lexicons , 1994, HLT.

[14]  Graham A Stephen,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[15]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[16]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[17]  Hans Lohninger,et al.  Teach/Me - Data Analysis , 1999 .

[18]  Tok Wang Ling,et al.  A knowledge-based approach for duplicate elimination in data cleaning , 2001, Inf. Syst..

[19]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[20]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[21]  Ralph Kimball,et al.  Dealing with dirty data , 1996 .

[22]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[23]  Erik-André Sauleau,et al.  Medical record linkage in health information systems by approximate string matching and clustering , 2005, BMC Medical Informatics Decis. Mak..

[24]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[25]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.