Towards task-based parallelization for entity resolution

Entity resolution (ER) refers to the problem of finding which virtual representations in one or more data sources refer to the same real-world entity. A central question in ER is how to find matching entity representations (so called duplicates) efficiently and in a scalable way. One general technique to address these issues is to leverage parallelization. In particular, almost all work on parallel ER focus on data parallelism. This paper focuses on task parallelism for ER. This type of parallelism allows to support incremental ER that offers incremental computation of the solution by streaming results of intermediate stages of ER as soon as they are computed. This possibly allows to obtain results in a more timely fashion and can also serve in a service-oriented setting with limited time or monetary budget. In summary, this paper presents a framework for task-parallelization of ER, supporting in particular ER of large amounts of semi-structured and heterogeneous data. We also discuss a possible implementation of our framework.

[1]  Felix Naumann,et al.  Progressive Duplicate Detection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[2]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[3]  Dmitri V. Kalashnikov,et al.  Progressive Approach to Relational Entity Resolution , 2014, Proc. VLDB Endow..

[4]  Gautam Shroff,et al.  Graph-Parallel Entity Resolution using LSH & IMM , 2014, EDBT/ICDT Workshops.

[5]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[6]  Sharad Mehrotra,et al.  Parallel Progressive Approach to Entity Resolution Using MapReduce , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[7]  Divesh Srivastava,et al.  Big Data Integration , 2015, Synthesis Lectures on Data Management.

[8]  Luigi Laura,et al.  Computing Strongly Connected Components in the Streaming Model , 2011, TAPAS.

[9]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[10]  Gunter Saake,et al.  Cloud-Scale Entity Resolution: Current State and Open Challenges , 2018, Open J. Big Data.

[11]  George Papastefanatos,et al.  Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[12]  Carlos Alberto Heuser,et al.  A fast approach for parallel deduplication on multicore processors , 2011, SAC '11.

[13]  Andreas Thor,et al.  Don't match twice: redundancy-free similarity computation with MapReduce , 2013, DanaC '13.

[14]  Claudia Niederée,et al.  Eliminating the redundancy in blocking-based entity resolution methods , 2011, JCDL '11.

[15]  Wagner Meira,et al.  A Scalable Parallel Deduplication Algorithm , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[16]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[17]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[18]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[19]  Vasilis Efthymiou,et al.  Entity resolution in the web of data , 2013, Entity Resolution in the Web of Data.

[20]  Hector Garcia-Molina,et al.  D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[21]  George Papadakis,et al.  Multi-core Meta-blocking for Big Linked Data , 2017, SEMANTiCS.

[22]  Hector Garcia-Molina,et al.  Pay-As-You-Go Entity Resolution , 2013, IEEE Transactions on Knowledge and Data Engineering.

[23]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[24]  Dongwon Lee,et al.  Parallel linkage , 2007, CIKM '07.

[25]  Sonia Bergamaschi,et al.  Schema-agnostic Progressive Entity Resolution (extended version) , 2019, ArXiv.

[26]  Sonia Bergamaschi,et al.  Schema-Agnostic Progressive Entity Resolution , 2019, IEEE Transactions on Knowledge and Data Engineering.

[27]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.