MapReduce Scheduler to Minimize the Size of Intermediate Data in Shuffle Phase

Hadoop MapReduce is one of the cost-effective ways for processing huge data in this decade. Despite it is opensource, setting up Hadoop on-premise is not affordable for small-scale businesses and research entities. Therefore, consuming Hadoop MapReduce as a service from cloud is on increasing pace as it is scalable on-demand and based on pay-per-use model. In such multi-tenant environment, virtual bandwidth is an expensive commodity and co-located virtual machines race each other to make use of the bandwidth. A study shows that 26%-70% of MapReduce job latency is due to shuffle phase in MapReduce execution sequence. Primary expectation of a typical cloud user is to minimize the service usage cost. Allocating less bandwidth to the service costs less but increases job latency, consequently increases makespan. This trade-off is compromised by minimizing the amount of intermediate data generated in shuffle phase at application level. To achieve this, we proposed Time Sharing MapReduce Job Scheduler to minimize the amount of intermediate data; thus, service cost is cut down. As a by-product, MapReduce job latency and makespan also are improved. Result shows that our proposed model minimized the size of intermediate data upto 62.1%, when compared to the classical schedulers with combiners.

[1]  Song Guo,et al.  Aggregation on the fly: reducing traffic for big data in the cloud , 2015, IEEE Network.

[2]  Yang Wang,et al.  Smart Shuffling in MapReduce: A Solution to Balance Network Traffic and Workloads , 2015, 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC).

[3]  Hyoung-Joo Kim,et al.  Hadoop Mapreduce Performance Enhancement Using In-node Combiners , 2015, ArXiv.

[4]  Xiaobo Zhou,et al.  iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2017, IEEE Transactions on Parallel and Distributed Systems.

[5]  Dong-Hee Shin,et al.  Demystifying big data: Anatomy of big data developmental process , 2016 .

[6]  Antony I. T. Rowstron,et al.  Camdoop: Exploiting In-network Aggregation for Big Data Applications , 2012, NSDI.

[7]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[8]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[9]  S. AnanthanarayanaV.,et al.  Multi-level per node combiner (MLPNC) to minimize mapreduce job latency on virtualized environment , 2018, SAC.

[10]  Xiaomin Zhu,et al.  Exploiting Efficient and Scalable Shuffle Transfers in Future Data Center Networks , 2015, IEEE Transactions on Parallel and Distributed Systems.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Changjun Jiang,et al.  Moving Hadoop into the Cloud with Flexible Slot Management and Speculative Execution , 2017, IEEE Transactions on Parallel and Distributed Systems.

[13]  Francis C. M. Lau,et al.  BAShuffler: Maximizing Network Bandwidth Utilization in the Shuffle of YARN , 2016, HPDC.

[14]  Cong Xu,et al.  JVM-Bypass for Efficient Hadoop Shuffling , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[15]  Patrick Valduriez,et al.  FP-Hadoop: Efficient processing of skewed MapReduce jobs , 2016, Inf. Syst..

[16]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[17]  Zhen Xiao,et al.  LIBRA: Lightweight Data Skew Mitigation in MapReduce , 2015, IEEE Transactions on Parallel and Distributed Systems.