Optimal Stopping: A Record-Linkage Approach

Record-linkage is the process of identifying whether two separate records refer to the same real-world entity when some elements of the record’s identifying information (attributes) agree and others disagree. Existing record-linkage decision methodologies use the outcomes from the comparisons of the whole set of attributes. Here, we propose an alternative scheme that assesses the attributes sequentially, allowing for a decision to made at any attribute’s comparison stage, and thus before exhausting all available attributes. The scheme we develop is optimum in that it minimizes a well-defined average cost criterion while the corresponding optimum solution can be easily mapped into a decision tree to facilitate the record-linkage decision process. Experimental results performed in real datasets indicate the superiority of our methodology compared to existing approaches.

[1]  B. J. Tepping A Model for Optimum Linkage of Records , 1968 .

[2]  Alʹbert Nikolaevich Shiri︠a︡ev,et al.  Optimal stopping rules , 1977 .

[3]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[4]  Pedro M. Domingos Multi-Relational Record Linkage , 2003 .

[5]  Andrew McCallum,et al.  Conditional Models of Identity Uncertainty with Application to Noun Coreference , 2004, NIPS.

[6]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[7]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[8]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[9]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[10]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[11]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[12]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[13]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[14]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[15]  Pradeep Ravikumar,et al.  A Hierarchical Graphical Model for Record Linkage , 2004, UAI.

[16]  George V. Moustakides,et al.  A Bayesian decision model for cost optimal record matching , 2003, The VLDB Journal.

[17]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[20]  Sudipto Guha,et al.  Merging the Results of Approximate Match Operations , 2004, VLDB.

[21]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[22]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[23]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[24]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[25]  George V. Moustakides,et al.  A generalized cost optimal decision model for record matching , 2004, IQIS '04.

[26]  Albert N. Shiryaev,et al.  Optimal Stopping Rules , 2011, International Encyclopedia of Statistical Science.

[27]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[28]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.