Handling Data Skew in MapReduce Cluster by Using Partition Tuning

The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data processing algorithm called Partition Tuning-based Skew Handling (PTSH). In comparison with the one-stage partitioning strategy used in the traditional MapReduce model, PTSH uses a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew. The robustness and efficiency of the proposed algorithm were tested on a wide variety of simulated datasets and real healthcare datasets. The results showed that PTSH algorithm can handle data skew in MapReduce efficiently and improve the performance of MapReduce jobs in comparison with the native Hadoop, Closer, and locality-aware and fairness-aware key partitioning (LEEN). We also found that the time needed for rule extraction can be reduced significantly by adopting the PTSH algorithm, since it is more suitable for association rule mining (ARM) on healthcare data.

[1]  Nikolaus Augsten,et al.  Handling Data Skew in MapReduce , 2011, CLOSER.

[2]  Durga Toshniwal,et al.  Association Rule for Classification of Type-2 Diabetic Patients , 2010, 2010 Second International Conference on Machine Learning and Computing.

[3]  Keqiu Li,et al.  Sampling-Based Partitioning in MapReduce for Skewed Data , 2012, 2012 Seventh ChinaGrid Annual Conference.

[4]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[5]  Garret Swart,et al.  Balancing reducer skew in MapReduce workloads using progressive sampling , 2012, SoCC '12.

[6]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[7]  Guillaume Pierre,et al.  Wikipedia workload analysis for decentralized hosting , 2009, Comput. Networks.

[8]  Kyuseok Shim,et al.  MapReduce Algorithms for Big Data Analysis , 2012, Proc. VLDB Endow..

[9]  M. R. Rao,et al.  The partition problem , 1993, Math. Program..

[10]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[11]  Raj Jain,et al.  A Quantitative Measure Of Fairness And Discrimination For Resource Allocation In Shared Computer Systems , 1998, ArXiv.

[12]  Jimmy J. Lin,et al.  The Curse of Zipf and Limits to Parallelization: An Look at the Stragglers Problem in MapReduce , 2009, LSDS-IR@SIGIR.

[13]  Edward Omiecinski,et al.  Alternative Interest Measures for Mining Associations in Databases , 2003, IEEE Trans. Knowl. Data Eng..

[14]  Viju Raghupathi,et al.  Big data analytics in healthcare: promise and potential , 2014, Health Information Science and Systems.

[15]  J. Ahmad,et al.  Analysis of effectiveness of apriori algorithm in medical billing data mining , 2008, 2008 4th International Conference on Emerging Technologies.

[16]  Marios D. Dikaiakos,et al.  Cloud Computing: Distributed Internet Computing for IT and Scientific Research , 2009, IEEE Internet Computing.

[17]  Divya Tomar,et al.  A survey on Data Mining approaches for Healthcare , 2013, BSBT 2013.

[18]  Hai Jin,et al.  Handling partitioning skew in MapReduce using LEEN , 2013, Peer Peer Netw. Appl..

[19]  Raouf Boutaba,et al.  OPTIMA: On-Line Partitioning Skew Mitigation for MapReduce with Resource Adjustment , 2016, Journal of Network and Systems Management.

[20]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[21]  M. Balazinska,et al.  A Study of Skew in MapReduce Applications , 2011 .

[22]  Yanqing Ji,et al.  Mining Infrequent Causal Associations in Electronic Health Databases , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[23]  Mohammad Hammoud,et al.  Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.