On the Use of Semantic Blocking Techniques for Data Cleansing and Integration

Record linkage (RL) is an important component of data cleansing and integration. For years, many efforts have focused on improving the performance of the RL process, either by reducing the number of record comparisons or by reducing the number of attribute comparisons, which reduces the computational time, but very often decreases the quality of the results. However, the real bottleneck of RL is the post-process, where the results have to be reviewed by experts that decide which pairs or groups of records are real links and which are false hits. In this paper, we show that exploiting the relationships (e.g. foreign key) established between one or more data sources, makes it possible to find a new sort of semantic blocking method that improves the number of hits and reduces the amount of review effort.

[1]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[2]  Dmitri V. Kalashnikov,et al.  Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[3]  William E. Winkler Data Cleaning Methods , 2003 .

[4]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[5]  Casper Goffman,et al.  And What is Your Erdös Number , 1969 .

[6]  Josep Domingo-Ferrer,et al.  Record linkage methods for multidatabase data mining , 2003 .

[7]  P. Doyle,et al.  Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , 2001 .

[8]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[9]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[10]  Srinath Srinivasa,et al.  LWI and Safari: A New Index Structure and Query Model for Graph Databases , 2005, COMAD.

[11]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  Andy Schürr,et al.  GRAS, a Graph-Oriented (Software) Engineering Database System , 1995, Inf. Syst..

[14]  Zhao Li,et al.  A fast filtering scheme for large database cleansing , 2002, CIKM '02.

[15]  Sugato Basu,et al.  Adaptive product normalization: using online learning for record linkage in comparison shopping , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[16]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[17]  Andrew W. Moore,et al.  Finding Underlying Connections: A Fast Graph-Based Method for Link Analysis and Collaboration Queries , 2003, ICML.

[18]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[19]  Marc Gyssens,et al.  A graph-oriented object database model , 1990, IEEE Trans. Knowl. Data Eng..