Enhancing network I/o performance for a virtualized Hadoop cluster

A MapReduce programming model is proposed to process big data using Hadoop, one of the major cloud computing frameworks. With the increasing adoption of cloud computing, running a Hadoop framework on a virtualized cluster is a compelling approach to reducing costs and increasing efficiency. In this paper, we measure the performance of a virtualized network and analyze the impact of network performance on Hadoop workloads running on a virtualized cluster. Then, we propose a virtualized network I/O architecture as a novel optimization for a virtualized Hadoop cluster for a public/private cloud provider. The proposed network architecture combines traditional network configurations and achieves better performance for Hadoop workloads. We also show a better way to utilize the rack awareness feature of the Hadoop framework in the proposed computing environment. The evaluation demonstrates that the proposed network architecture and mechanisms improve performance by up to 4.1 times compared with a bridge network architecture. This novel architecture can even virtually match the performance of the expensive, hardware‐based single root I/O virtualization network architecture.

[1]  Jiuxing Liu Evaluating standard-based self-virtualizing devices: A performance study on 10 GbE NICs with SR-IOV support , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  Jungkyu Han,et al.  Design and performance evaluation for Hadoop clusters on virtualized environment , 2013, The International Conference on Information Networking 2013 (ICOIN).

[3]  Michael Griebel,et al.  Massively Parallel Fluid Simulations on Amazon's HPC Cloud , 2011, 2011 First International Symposium on Network Cloud Computing and Applications.

[4]  Dario Pompili,et al.  Energy-Efficient Thermal-Aware Autonomic Management of Virtualized HPC Cloud Infrastructure , 2012, Journal of Grid Computing.

[5]  A. Raj,et al.  Enhancement of Hadoop Clusters with Virtualization Using the Capacity Scheduler , 2012, 2012 Third International Conference on Services in Emerging Markets.

[6]  Roy D. Sleator,et al.  'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.

[7]  Huang Lu,et al.  Research on Hadoop Cloud Computing Model and its Applications , 2012, 2012 Third International Conference on Networking and Distributed Computing.

[8]  Yang Yang,et al.  Impacts of Virtualization Technologies on Hadoop , 2013, 2013 Third International Conference on Intelligent System Design and Engineering Applications.

[9]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[10]  Zhao Yu,et al.  SR-IOV Networking in Xen: Architecture, Design and Implementation , 2008, Workshop on I/O Virtualization.

[11]  Shrinivas B. Joshi,et al.  Apache hadoop performance-tuning methodologies and best practices , 2012, ICPE '12.

[12]  Christoforos E. Kozyrakis,et al.  On the energy (in)efficiency of Hadoop clusters , 2010, OPSR.

[13]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Hiroaki Kobayashi,et al.  A History-Based Job Scheduling Mechanism for the Vector Computing Cloud , 2010, 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet.

[16]  Seung-Jong Park,et al.  Network-aware scheduling of mapreduce framework ondistributed clusters over high speed networks , 2012, FederatedClouds '12.

[17]  Hans De Sterck,et al.  CloudWF: A Computational Workflow System for Clouds Based on Hadoop , 2009, CloudCom.

[18]  Xiaowei Yang,et al.  High performance network virtualization with SR-IOV , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[19]  Jungkyu Han,et al.  A Hadoop performance model for multi-rack clusters , 2013, 2013 5th International Conference on Computer Science and Information Technology.

[20]  Guangwen Yang,et al.  Location-Aware MapReduce in Virtual Cloud , 2011, 2011 International Conference on Parallel Processing.

[21]  Jeffrey Shafer,et al.  I/O virtualization bottlenecks in cloud computing today , 2010 .

[22]  Himabindu Pucha,et al.  Towards Optimizing Hadoop Provisioning in the Cloud , 2009, HotCloud.

[23]  Hai Jin,et al.  Evaluating MapReduce on Virtual Machines: The Hadoop Case , 2009, CloudCom.

[24]  Shin Gyu Kim,et al.  Improving Hadoop performance in intercloud environments , 2011, PERV.