Design and performance evaluation for Hadoop clusters on virtualized environment

Hadoop an implementation of Google's MapReduce, is widely used in these days for big data analysis. Yahoo Inc. operated 25 PB with 25,000 nodes in 2010. The resource management for such large number of nodes is quite difficult from the aspects of configuration, deployment and efficient resource utilization. By deploying virtual machines (VMs), Hadoop management becomes much easier. Amazon already released the Hadoop on Xen-virtualized environment as Elastic MapReduce. However, Hadoop on VM clusters degrades its performance due to the overhead of the virtualization. Thus, it is important to minimize the overhead. We build a Hadoop performance model and examine how the performance is affected by changing VM configuration, allocation of VMs over physical machines, and multiplicity of jobs. We find that performance of the I/O-intensive jobs is more sensitive to the virtualization overhead than that of CPU-intensive jobs. The performance degradation caused by the VM configuration change is 55% at most and the one caused by allocation change is 18% at most for I/O-intensive jobs. For I/O intensive jobs, the best practice is to increase the number of VMs and not to increase the number of VCPUs in a VM, to allocate VMs widely over physical servers, and to decrease the number of simultaneous executed jobs. The main factor of virtualization overhead is disk I/O shared by multiple VMs in a physical server.

[1]  김병기,et al.  Xen 가상머신에서 실시간 게스트 도메인들의 효율적인 자원할당 기법 , 2011 .

[2]  Huan Liu,et al.  Cloud MapReduce: A MapReduce Implementation on Top of a Cloud Operating System , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[3]  Limin Xiao,et al.  Towards Deploying Elastic Hadoop in the Cloud , 2011, 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[4]  Horacio González-Vélez,et al.  Benchmarking a MapReduce Environment on a Full Virtualisation Platform , 2010, 2010 International Conference on Complex, Intelligent and Software Intensive Systems.

[5]  Prashant J. Shenoy,et al.  A platform for scalable one-pass analytics using MapReduce , 2011, SIGMOD '11.

[6]  Alex Delis,et al.  Flexible use of cloud resources through profit maximization and price discrimination , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[7]  G. Ganger,et al.  Applying Simple Performance Models to Understand Inefficiencies in Data-Intensive Computing , 2011 .

[8]  Hai Jin,et al.  CLOUDLET: towards mapreduce implementation on virtual machines , 2009, HPDC '09.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Chang Gao,et al.  XConveryer: Guarantee Hadoop Throughput via Lightweight OS-Level Virtualization , 2009, 2009 Eighth International Conference on Grid and Cooperative Computing.

[11]  Muli Ben-Yehuda,et al.  Quantitative Comparison of Xen and KVM , 2008 .

[12]  刘锋,et al.  Kernel-based virtual machine事件跟踪机制的设计与实现 , 2008 .

[13]  Fangzhe Chang,et al.  Optimal Resource Allocation in Clouds , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[14]  L.Bharathi G.Sireesha,et al.  Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud , 2011, IEEE Transactions on Parallel and Distributed Systems.

[15]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[16]  Guangwen Yang,et al.  Location-Aware MapReduce in Virtual Cloud , 2011, 2011 International Conference on Parallel Processing.

[17]  Hongxu Ma,et al.  Deploying and researching Hadoop in virtual machines , 2012, 2012 IEEE International Conference on Automation and Logistics.

[18]  李 昌桓 Amazon Elastic MapReduceテクニカルガイド : クラウド型Hadoopで実現する大規模分散処理 : technical guide , 2012 .

[19]  Jeffrey S. Chase,et al.  Provisioning and Evaluating Multi-domain Networked Clouds for Hadoop-based Applications , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[20]  Jiann-Liang Chen,et al.  Optimal QoS load balancing mechanism for virtual machines scheduling in eucalyptus cloud computing platform , 2012, 2012 2nd Baltic Congress on Future Internet Communications.

[21]  José A. B. Fortes,et al.  Grey-Box Approach for Performance Prediction in Map-Reduce Based Platforms , 2012, 2012 21st International Conference on Computer Communications and Networks (ICCCN).

[22]  Hussein M. Alnuweiri,et al.  Resource allocation and scheduling in cloud computing , 2012, 2012 International Conference on Computing, Networking and Communications (ICNC).

[23]  Odej Kao,et al.  Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud , 2011, IEEE Transactions on Parallel and Distributed Systems.

[24]  Hai Jin,et al.  Adaptive Disk I/O Scheduling for MapReduce in Virtualized Environment , 2011, 2011 International Conference on Parallel Processing.