A Distributed Load Balance Algorithm of MapReduce for Data Quality Detection

Big data quality detection is a valuable problem in data quality field. MapReduce is an important distributed data processing model mainly for big data processing. Load balance is a key factor that influences the property of MapReduce. In this paper, we propose a distributed greedy approximation algorithm for load balance problem in MapReduce for data quality detection. There are three key challenges: (a) reduce the problem to NP-complete and prove a considerable approximation ratio of the proposed algorithm, (b) just impose one more round of MapReduce than conventional processing and occupy minimal time in the total process, (c) be simple and convenient feasible. Experimental results on real-life and synthetic data demonstrate that the proposed algorithm in this paper is effective for load balance.

[1]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[2]  Jef Wijsen,et al.  Determining the currency of data , 2012 .

[3]  Andreas Thor,et al.  Block-based load balancing for entity resolution with MapReduce , 2011, CIKM '11.

[4]  Bo Gao,et al.  Improving the Load Balance of MapReduce Operations based on the Key Distribution of Pairs , 2014, ArXiv.

[5]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[6]  Yeye He,et al.  ClusterJoin: A Similarity Joins Framework using Map-Reduce , 2014, Proc. VLDB Endow..

[7]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[8]  Weizhong Zhao,et al.  h-MapReduce: A Framework for Workload Balancing in MapReduce , 2013, 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA).

[9]  Wenfei Fan,et al.  Inferring data currency and consistency for conflict resolution , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[10]  Wenfei Fan,et al.  Determining the relative accuracy of attributes , 2013, SIGMOD '13.

[11]  Keqiu Li,et al.  Sampling-Based Partitioning in MapReduce for Skewed Data , 2012, 2012 Seventh ChinaGrid Annual Conference.

[12]  Wei Wei,et al.  LBVP: A load balance algorithm based on Virtual Partition in Hadoop cluster , 2012, 2012 IEEE Asia Pacific Cloud Computing Congress (APCloudCC).

[13]  Bo Gao,et al.  OS4M: Achieving Global Load Balance of MapReduce Workload by Scheduling at the Operation Level , 2014, ArXiv.

[14]  Garret Swart,et al.  Balancing reducer skew in MapReduce workloads using progressive sampling , 2012, SoCC '12.

[15]  Vijay Varadharajan,et al.  Dynamic Workload Balancing for Hadoop MapReduce , 2014, 2014 IEEE Fourth International Conference on Big Data and Cloud Computing.

[16]  David P. Williamson,et al.  The Design of Approximation Algorithms , 2011 .

[17]  Wenfei Fan,et al.  Conflict resolution with data currency and consistency , 2014, ACM J. Data Inf. Qual..