Faster MapReduce Computation on Clouds Through Better Performance Estimation

Processing Big Data in cloud is on the increase. An important issue for efficient execution of Big Data processing jobs on a cloud platform is selecting the best fitting virtual machine (VM) configuration(s) among the miscellany of choices that cloud providers offer. Wise selection of VM configurations can lead to better performance, cost and energy consumption. Therefore, it is crucial to explore the available configurations and opt for the best ones that well suit each MapReduce application. Profiling the given application on all the configurations is costly, time and energy consuming. An alternative is to run the application on a subset of configurations (sample configurations) and estimate its performance on other configurations based on the obtained values by sample configurations. We show that the choice of these sample configurations highly affects accuracy of later estimations. Our Smart Configuration Selection (SCS) scheme chooses better representatives from among all configurations by once-off analysis of given performance figures of the benchmarks so as to increase the accuracy of estimations of missing values, and consequently, to more accurately choose the configuration providing the highest performance. The results show that the SCS choice of sample configurations is very close to the best choice, and can reduce estimation error to 11.58 percent from the original 19.72 percent of random configuration selection. More importantly, using SCS estimations in a makespan minimization algorithm improves the execution time by up to 36.03 percent compared with random sample selection.

[1]  Boon Thau Loo,et al.  Exploiting Cloud Heterogeneity to Optimize Performance and Cost of MapReduce Processing , 2015, PERV.

[2]  Mohamed Faten Zhani,et al.  PRISM: Fine-Grained Resource-Aware Scheduling for MapReduce , 2015, IEEE Transactions on Cloud Computing.

[3]  J. Marden Analyzing and Modeling Rank Data , 1996 .

[4]  Maolin Tang,et al.  A New Approach to the Cloud-Based Heterogeneous MapReduce Placement Problem , 2016, IEEE Transactions on Services Computing.

[5]  Alessandro Maria Rizzi,et al.  Optimal Map Reduce Job Capacity Allocation in Cloud Systems , 2015, PERV.

[6]  Weisong Shi,et al.  Energy-Aware Scheduling of MapReduce Jobs for Big Data Applications , 2015, IEEE Transactions on Parallel and Distributed Systems.

[7]  W. Knight A Computer Method for Calculating Kendall's Tau with Ungrouped Data , 1966 .

[8]  Ramin Yahyapour,et al.  Hybrid Job-Driven Scheduling for Virtual MapReduce Clusters , 2016, IEEE Transactions on Parallel and Distributed Systems.

[9]  Anand Raghunathan,et al.  ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters , 2014, USENIX Annual Technical Conference.

[10]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[11]  Zibin Zheng,et al.  QoS Ranking Prediction for Cloud Services , 2013, IEEE Transactions on Parallel and Distributed Systems.

[12]  Maziar Goudarzi,et al.  Communication-Awareness for Energy-Efficiency in Datacenters , 2016, Adv. Comput..

[13]  Sudheer Kumar Battula,et al.  Performance Evaluation of Read and Write Operations in Hadoop Distributed File System , 2014, 2014 Sixth International Symposium on Parallel Architectures, Algorithms and Programming.

[14]  Seyong Lee,et al.  MapReduce with communication overlap (MaRCO) , 2013, J. Parallel Distributed Comput..

[15]  Roy H. Campbell,et al.  Orchestrating an Ensemble of MapReduce Jobs for Minimizing Their Makespan , 2013, IEEE Transactions on Dependable and Secure Computing.

[16]  Gabriel Antoniu,et al.  OverFlow: Multi-Site Aware Big Data Management for Scientific Workflows on Clouds , 2016, IEEE Transactions on Cloud Computing.

[17]  Christina Delimitrou,et al.  The Netflix Challenge: Datacenter Edition , 2013, IEEE Computer Architecture Letters.

[18]  Keqin Li,et al.  Adaptive Workflow Scheduling on Cloud Computing Platforms with IterativeOrdinal Optimization , 2015, IEEE Transactions on Cloud Computing.

[19]  Keke Chen,et al.  CRESP: Towards Optimal Resource Provisioning for MapReduce Computing in Public Clouds , 2014, IEEE Transactions on Parallel and Distributed Systems.

[20]  Bu-Sung Lee,et al.  Dynamic Job Ordering and Slot Configurations for MapReduce Workloads , 2016, IEEE Transactions on Services Computing.

[21]  Lei Yu,et al.  A Hadoop MapReduce Performance Prediction Method , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[22]  Jorge Ejarque,et al.  Dynamic energy-aware scheduling for parallel task-based application in cloud computing , 2018, Future Gener. Comput. Syst..

[23]  Muhammad Ali Ismail,et al.  Impact of HDFS block size in MapReduce based segmentation and feature extraction algorithm , 2015, 2015 International Conference on Open Source Systems & Technologies (ICOSST).

[24]  Yi Yao,et al.  Self-Adjusting Slot Configurations for Homogeneous and Heterogeneous Hadoop Clusters , 2017, IEEE Transactions on Cloud Computing.

[25]  Terrance E. Boult,et al.  Performance Measurement and Interference Profiling in Multi-tenant Clouds , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[26]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[27]  Andrea Montanari,et al.  Matrix completion from a few entries , 2009, 2009 IEEE International Symposium on Information Theory.

[28]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[29]  Maziar Goudarzi,et al.  Energy efficiency in cloud-based MapReduce applications through better performance estimation , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[30]  Keqin Li,et al.  Accelerating MapReduce on Commodity Clusters: An SSD-Empowered Approach , 2018, IEEE Transactions on Big Data.

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Yang Wang,et al.  Budget-Driven Scheduling Algorithms for Batches of MapReduce Jobs in Heterogeneous Clouds , 2014, IEEE Transactions on Cloud Computing.

[33]  Emmanuel J. Candès,et al.  Templates for convex cone problems with applications to sparse signal recovery , 2010, Math. Program. Comput..

[34]  Ling Liu,et al.  Cost-Effective Resource Provisioning for MapReduce in a Cloud , 2015, IEEE Transactions on Parallel and Distributed Systems.

[35]  Yi Yao,et al.  LsPS: A Job Size-Based Scheduler for Efficient Task Assignments in Hadoop , 2015, IEEE Transactions on Cloud Computing.

[36]  Kyong Hoon Kim,et al.  Minimizing Cost of Virtual Machines for Deadline-Constrained MapReduce Applications in the Cloud , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[37]  Evgenia Smirni,et al.  Optimizing Power and Performance Trade-offs of MapReduce Job Processing with Heterogeneous Multi-core Processors , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[38]  Meikang Qiu,et al.  Phase-Reconfigurable Shuffle Optimization for Hadoop MapReduce , 2020, IEEE Transactions on Cloud Computing.

[39]  Pietro Michiardi,et al.  HFSP: Bringing Size-Based Scheduling To Hadoop , 2017, IEEE Transactions on Cloud Computing.

[40]  Lieven Eeckhout,et al.  RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop's Configuration , 2016, IEEE Transactions on Parallel and Distributed Systems.

[41]  Maziar Goudarzi,et al.  The Memory Challenge in Reduce Phase of MapReduce Applications , 2016, IEEE Transactions on Big Data.

[42]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[43]  Robert M. Bell,et al.  The BellKor 2008 Solution to the Netflix Prize , 2008 .

[44]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[45]  Seyong Lee,et al.  PUMA: Purdue MapReduce Benchmarks Suite , 2012 .