End-to-end Task Based Parallelization for Entity Resolution on Dynamic Data

Entity resolution (ER) is the problem of finding which digital representations of entities correspond to the same real-world entity. In many Big Data scenarios, in addition to the problems of volume and variety that are commonly addressed in ER, data is continuously generated, which requires novel solutions to address the velocity problem.This paper presents a framework for end-to-end ER that incrementally and efficiently produces results as heterogeneous data streams in. These characteristics are achieved by proposing a novel functional model for ER on incremental or streaming data, and adopting task-based parallelization. Our evaluation demonstrates that even without parallelization, our framework outperforms state-of-the-art (batch) ER in terms of runtime and quality. We also validate that it can achieve high throughput and low latency on streaming data, paving the way to real-time ER.

[1]  George Papastefanatos,et al.  Boosting the Efficiency of Large-Scale Entity Resolution with Enhanced Meta-Blocking , 2016, Big Data Res..

[2]  K. Stefanidis,et al.  End-to-End Entity Resolution for Big Data: A Survey , 2019, ArXiv.

[3]  Aris Gkoulalas-Divanis,et al.  Summarization Algorithms for Record Linkage , 2018, EDBT.

[4]  Wagner Meira,et al.  A Scalable Parallel Deduplication Algorithm , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[5]  Carlos Eduardo S. Pires,et al.  Schema-agnostic blocking for streaming data , 2020, SAC.

[6]  Divesh Srivastava,et al.  Incremental Record Linkage , 2014, Proc. VLDB Endow..

[7]  George Papadakis,et al.  Multi-core Meta-blocking for Big Linked Data , 2017, SEMANTiCS.

[8]  Gunter Saake,et al.  Cloud-Scale Entity Resolution: Current State and Open Challenges , 2018, Open J. Big Data.

[9]  Andreas Thor,et al.  Don't match twice: redundancy-free similarity computation with MapReduce , 2013, DanaC '13.

[10]  George Papastefanatos,et al.  Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[11]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[12]  Carlos Eduardo S. Pires,et al.  Heuristic-based approaches for speeding up incremental record linkage , 2018, J. Syst. Softw..

[13]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[14]  Sonia Bergamaschi,et al.  Schema-Agnostic Progressive Entity Resolution , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[15]  George Papastefanatos,et al.  Parallel meta-blocking for scaling entity resolution over big heterogeneous data , 2017, Inf. Syst..

[16]  Andreas Thor,et al.  Tailoring entity resolution for matching product offers , 2012, EDBT '12.

[17]  Alieh Saeedi,et al.  Incremental Multi-source Entity Resolution for Knowledge Graph Completion , 2020, ESWC.

[18]  Huizhi Liang,et al.  Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution , 2013, PAKDD Workshops.

[19]  George Papadakis,et al.  The return of JedAI: End-to-End Entity Resolution for Structured and Semi-Structured Data , 2018, Proc. VLDB Endow..

[20]  Vasilis Efthymiou,et al.  Benchmarking Blocking Algorithms for Web Entities , 2020, IEEE Transactions on Big Data.

[21]  Huizhi Liang,et al.  Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution , 2015, ACM J. Data Inf. Qual..

[22]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[23]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[24]  Melanie Herschel,et al.  Towards task-based parallelization for entity resolution , 2019, SICS Software-Intensive Cyber-Physical Systems.

[25]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[26]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..