Probabilistic Data Generation for Deduplication and Data Linkage

In many data mining projects the data to be analysed contains personal information, like names and addresses. Cleaning and pre-processing of such data likely involves deduplication or linkage with other data, which is often challenged by a lack of unique entity identifiers. In recent years there has been an increased research effort in data linkage and deduplication, mainly in the machine learning and database communities. Publicly available test data with known deduplication or linkage status is needed so that new linkage algorithms and techniques can be tested, evaluated and compared. However, publication of data containing personal information is normally impossible due to privacy and confidentiality issues. An alternative is to use artificially created data, which has the advantages that content and error rates can be controlled, and the deduplication or linkage status is known. Controlled experiments can be performed and replicated easily. In this paper we present a freely available data set generator capable of creating data sets containing names, addresses and other personal information.

[1]  Pradeep Ravikumar,et al.  A Hierarchical Graphical Model for Record Linkage , 2004, UAI.

[2]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[3]  Lifang Gu,et al.  Adaptive Filtering for Efficient Record Linkage , 2004, SDM.

[4]  Un Yong Nahm and Mikhail Bilenko and Raymond J. Mooney,et al.  Two Approaches to Handling Noisy Variation in Text Mining , 2002 .

[5]  Peter Christen,et al.  Febrl - A Parallel Open Source Data Linkage System: http://datamining.anu.edu.au/linkage.html , 2004, PAKDD.

[6]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[7]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[8]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[9]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[10]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[11]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[12]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[13]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[14]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[15]  William E. Yancey An Adaptive String Comparator for Record Linkage , 2004 .

[16]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[17]  Lyle H. Ungar,et al.  String Edit Analysis for Merging Databases , 2000, KDD 2000.

[18]  Antonio Zamora,et al.  Automatic spelling correction in scientific and scholarly text , 1984, CACM.

[19]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[20]  Mikhail Bilenko and Raymond J. Mooney,et al.  On Evaluation and Training-Set Construction for Duplicate Detection , 2003 .

[21]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[22]  Michael Giffin,et al.  New South Wales mothers and babies 2001. , 2002, New South Wales public health bulletin.

[23]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.