Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data

Entity resolution constitutes a crucial task for many applications, but has an inherently quadratic complexity. Typically, it scales to large volumes of data through blocking: similar entities are clustered into blocks so that it suffices to perform comparisons only within each block. Meta-blocking further increases efficiency by cleaning the overlapping blocks from unnecessary comparisons. However, even Meta-blocking can be time-consuming: applying it to blocks with 7.4 million entities and 2.21011 comparisons takes almost 8 days on a modern high-end server. In this paper, we parallelize Meta-blocking based on MapReduce. We propose a simple strategy that explicitly creates the core concept of Meta-blocking, the blocking graph. We then describe an advanced strategy that creates the blocking graph implicitly, reducing the overhead of data exchange. We also introduce a load balancing algorithm that distributes the computationally intensive workload evenly among the available compute nodes. Our experimental analysis verifies the superiority of our advanced strategy and demonstrates an almost linear speedup for all meta-blocking techniques with respect to the number of available nodes.

[1]  Alon Y. Halevy,et al.  Semantic Integration Research in the Database Community : A Brief Survey , 2005 .

[2]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[3]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[4]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[5]  Andreas Thor,et al.  Multi-pass sorted neighborhood blocking with MapReduce , 2012, Computer Science - Research and Development.

[6]  Gerhard Weikum,et al.  LINDA: distributed web-of-data-scale entity matching , 2012, CIKM.

[7]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[8]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[9]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[10]  VassilisChristophides,et al.  Entity Resolution in the Web of Data , 2015 .

[11]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[12]  Peter Christen,et al.  Febrl - A Parallel Open Source Data Linkage System: http://datamining.anu.edu.au/linkage.html , 2004, PAKDD.

[13]  Vasilis Efthymiou,et al.  Big data entity resolution: From highly to somehow similar entity descriptions in the Web , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[14]  Felix Naumann,et al.  Scalable Iterative Graph Duplicate Detection , 2012, IEEE Transactions on Knowledge and Data Engineering.

[15]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[16]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[17]  George Papastefanatos,et al.  Boosting the Efficiency of Large-Scale Entity Resolution with Enhanced Meta-Blocking , 2016, Big Data Res..

[18]  Dongwon Lee,et al.  Parallel linkage , 2007, CIKM '07.

[19]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.