Load Balancing for MapReduce-based Entity Resolution

The effectiveness and scalability of MapReduce-based implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancing among all reduce tasks to be executed in parallel. For the complex problem of entity resolution, we propose and evaluate two approaches for such skew handling and load balancing. The approaches support blocking techniques to reduce the search space of entity resolution, utilize a preprocessing MapReduce job to analyze the data distribution, and distribute the entities of large blocks among multiple reduce tasks. The evaluation on a real cloud infrastructure shows the value and effectiveness of the proposed load balancing approaches.

[1]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[2]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[3]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[4]  Douglas Thain,et al.  All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids , 2010, IEEE Transactions on Parallel and Distributed Systems.

[5]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[6]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[7]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[8]  Andreas Thor,et al.  Block-based load balancing for entity resolution with MapReduce , 2011, CIKM '11.

[9]  Michael Stonebraker,et al.  MapReduce: A major step backwards , 2014 .

[10]  Jimmy J. Lin,et al.  The Curse of Zipf and Limits to Parallelization: An Look at the Stragglers Problem in MapReduce , 2009, LSDS-IR@SIGIR.

[11]  Magdalena Balazinska,et al.  Skew-resistant parallel processing of feature-extracting scientific user-defined functions , 2010, SoCC '10.

[12]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[13]  Andreas Thor,et al.  Multi-pass sorted neighborhood blocking with MapReduce , 2012, Computer Science - Research and Development.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[16]  Christopher Ré,et al.  Manimal: relational optimization for data-intensive programs , 2010, WebDB '10.

[17]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[18]  Geoffrey C. Fox,et al.  IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Cloud Technologies for Bioinformatics Applications , 2022 .

[19]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[20]  Jianmin Wang,et al.  MapDupReducer: detecting near duplicates over massive datasets , 2010, SIGMOD Conference.

[21]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..