论文信息 - Optimal Stopping: A Record-Linkage Approach - 字舞流文

Optimal Stopping: A Record-Linkage Approach

Record-linkage is the process of identifying whether two separate records refer to the same real-world entity when some elements of the record’s identifying information (attributes) agree and others disagree. Existing record-linkage decision methodologies use the outcomes from the comparisons of the whole set of attributes. Here, we propose an alternative scheme that assesses the attributes sequentially, allowing for a decision to made at any attribute’s comparison stage, and thus before exhausting all available attributes. The scheme we develop is optimum in that it minimizes a well-defined average cost criterion while the corresponding optimum solution can be easily mapped into a decision tree to facilitate the record-linkage decision process. Experimental results performed in real datasets indicate the superiority of our methodology compared to existing approaches.

George V. Moustakides | Vassilios S. Verykios | G. Moustakides | V. Verykios

[1] B. J. Tepping. A Model for Optimum Linkage of Records , 1968 .

[2] Alʹbert Nikolaevich Shiri︠a︡ev,et al. Optimal stopping rules , 1977 .

[3] Rajeev Motwani,et al. Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[4] Pedro M. Domingos. Multi-Relational Record Linkage , 2003 .

[5] Andrew McCallum,et al. Conditional Models of Identity Uncertainty with Application to Noun Coreference , 2004, NIPS.

[6] H B NEWCOMBE,et al. Automatic linkage of vital records. , 1959, Science.

[7] Salvatore J. Stolfo,et al. The merge/purge problem for large databases , 1995, SIGMOD '95.

[8] Anuradha Bhamidipaty,et al. Interactive deduplication using active learning , 2002, KDD.

[9] W. Winkler. IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[10] William W. Cohen,et al. Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[11] Pradeep Ravikumar,et al. Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[12] Surajit Chaudhuri,et al. Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[13] Ivan P. Fellegi,et al. A Theory for Record Linkage , 1969 .

[14] Charles Elkan,et al. The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[15] Pradeep Ravikumar,et al. A Hierarchical Graphical Model for Record Linkage , 2004, UAI.

[16] George V. Moustakides,et al. A Bayesian decision model for cost optimal record matching , 2003, The VLDB Journal.

[17] Howard B. Newcombe,et al. Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[18] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19] Craig A. Knoblock,et al. Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[20] Sudipto Guha,et al. Merging the Results of Approximate Match Operations , 2004, VLDB.

[21] William W. Cohen. Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[22] P. Ivax,et al. A THEORY FOR RECORD LINKAGE , 2004 .

[23] Ahmed K. Elmagarmid,et al. Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[24] Panagiotis G. Ipeirotis,et al. Duplicate Record Detection: A Survey , 2007 .

[25] George V. Moustakides,et al. A generalized cost optimal decision model for record matching , 2004, IQIS '04.

[26] Albert N. Shiryaev,et al. Optimal Stopping Rules , 2011, International Encyclopedia of Statistical Science.

[27] Thorsten Joachims,et al. Making large scale SVM learning practical , 1998 .

[28] Dennis Shasha,et al. Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.