Automatic Identification of Best Attributes for Indexing in Data Deduplication

We introduce an approach that selects relevant attributes to the indexing step of data deduplication, reducing the whole processing time and improving the deduplication effectiveness. We evaluate the proposed method on synthetic and real datasets over distinct domains. We also evaluate the impact of choosing the indexing attributes over the other steps of the deduplication process, then concluding our solution is both efficient (time cost) and effective (results quality) as a whole.

[1]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[2]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[3]  Ana Carolina Salgado,et al.  A Strategy for Selecting Relevant Attributes for Entity Resolution in Data Integration Systems , 2017, ICEIS.

[4]  Graçaliz Pereira Dimuro,et al.  Contact Deduplication in Mobile Devices using Textual Similarity and Machine Learning , 2017, ICEIS.

[5]  C. Lee Giles,et al.  Adaptive sorted neighborhood methods for efficient record linkage , 2007, JCDL '07.

[6]  Weifeng Su,et al.  Record Matching over Query Results from Multiple Web Databases , 2010, IEEE Transactions on Knowledge and Data Engineering.

[7]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[8]  Carlos Alberto Heuser,et al.  Using Genetic Programming to Evaluate the Impact of Social Network Analysis in Author Name Disambiguation , 2010, AMW.

[9]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[10]  Judith L. Klavans,et al.  Methods for precise named entity matching in digital collections , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[11]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[12]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[13]  George Papastefanatos,et al.  Schema-agnostic vs Schema-based Configurations for Blocking Methods on Homogeneous Data , 2015, Proc. VLDB Endow..

[14]  Wagner Meira,et al.  Entity Matching: A Case Study in the Medical Domain , 2015, AMW.

[15]  Ana Carolina Salgado,et al.  A Query-Driven, Incremental Process for Entity Resolution , 2016, AMW.