论文信息 - Efficient Entity Resolution Based on Sequence Rules

Efficient Entity Resolution Based on Sequence Rules

Entity resolution (ER) is to find the data objects referring to the same real-world entity. When ER is performed on relations, the crucial operator is record matching, which is to judge whether two tuples referring to the same real-world entity. Record matching is a longstanding issue. However, with massive and complex data in applications, current methods cannot satisfy the requirements. A Sequence-rule-based record matching (SeReMatching) is presented with the consideration of both the values of the attributes and their importance in record matching. And with the help of the Bloom Filter we changed, the algorithm greatly increases the checking speed and makes the complexity of entity resolution almost O(n). And extensive experiments are performed to evaluate our methods.

[1] Joseph M. Hellerstein,et al. Quantitative Data Cleaning for Large Databases , 2008 .

[2] Tok Wang Ling,et al. A knowledge-based approach for duplicate elimination in data cleaning , 2001, Inf. Syst..

[3] Guo Zhi-mao,et al. Research on Data Quality and Data Cleaning: a Survey , 2002 .

[4] Erhard Rahm,et al. Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[5] Marcel Waldvogel,et al. Efficient topology-aware overlay network , 2003, CCRV.

[6] Michael Mitzenmacher,et al. Compressed bloom filters , 2002, TNET.

[7] Theodore Johnson,et al. Exploratory Data Mining and Data Cleaning , 2003 .

[8] Divesh Srivastava,et al. Benchmarking declarative approximate selection predicates , 2007, SIGMOD '07.

[9] James K. Mullin,et al. A second look at bloom filters , 1983, CACM.

[10] Louis Perrochon,et al. Towards Improving Data Quality , 1993, CISMOD.

[11] Georgia Koutrika,et al. Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[12] Salvatore J. Stolfo,et al. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[13] Jianzhong Li,et al. The VLDB Journal manuscript No. (will be inserted by the editor) Dynamic Constraints for Record Matching , 2022 .

[14] Renée J. Miller,et al. Creating probabilistic databases from duplicated data , 2009, The VLDB Journal.