Boosting Blocking Performance in Entity Resolution Pipelines: Comparison Cleaning using Bloom Filters

Entity Resolution (ER) allows to identify different virtual representations of entities that refer to the same real world entity. When applied to highly heterogeneous data, ER relies on schemaagnostic blocking techniques to improve efficiency while yielding good effectiveness. A drawback of schema-agnostic blocking is the potentially high number of redundant pairwise comparisons. This has led to the introduction of additional efficiency layers beyond blocking in the overall ER pipeline, which all aim at pruning comparisons to reduce the unnecessary time overhead. This paper proposes a novel technique based on Bloom filters that integrates in such an efficiency layer. In addition to avoiding redundant comparisons, it further prunes superfluous comparisons that are unlikely to result in matches when actually compared. Experiments on benchmark datasets show that our approach improves existing approaches in space and time efficiency, with insignificant changes in effectiveness.

[1]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[2]  Claudia Niederée,et al.  Eliminating the redundancy in blocking-based entity resolution methods , 2011, JCDL '11.

[3]  David Hutchison,et al.  Scalable Bloom Filters , 2007, Inf. Process. Lett..

[4]  Peter Fankhauser,et al.  Efficient entity resolution for large heterogeneous information spaces , 2011, WSDM '11.

[5]  Vasilis Efthymiou,et al.  Benchmarking Blocking Algorithms for Web Entities , 2020, IEEE Transactions on Big Data.

[6]  Erhard Rahm,et al.  Scalable Privacy-Preserving Linking of Multiple Databases Using Counting Bloom Filters , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[7]  George Papastefanatos,et al.  Parallel meta-blocking for scaling entity resolution over big heterogeneous data , 2017, Inf. Syst..

[8]  George Papadakis,et al.  The return of JedAI: End-to-End Entity Resolution for Structured and Semi-Structured Data , 2018, Proc. VLDB Endow..

[9]  Sonia Bergamaschi,et al.  BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution , 2016, Proc. VLDB Endow..

[10]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[11]  Aris Gkoulalas-Divanis,et al.  Summarization Algorithms for Record Linkage , 2018, EDBT.

[12]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[14]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..