Location-Aware MapReduce in Virtual Cloud

MapReduce is an important programming model for processing and generating large data sets in parallel. It is commonly applied in applications such as web indexing, data mining, machine learning, etc. As an open-source implementation of MapReduce, Hadoop is now widely used in industry. Virtualization, which is easy to configure and economical to use, shows great potential for cloud computing. With the increasing core number in a CPU and involving of virtualization technique, one physical machine can hosts more and more virtual machines, but I/O devices normally do not increase so rapidly. As MapReduce system is often used to running I/O intensive applications, decreasing of data redundancy and load unbalance, which increase I/O interference in virtual cloud, come to be serious problems. This paper builds a model and defines metrics to analyze the data allocation problem in virtual environment theoretically. And we design a location-aware file block allocation strategy that retains compatibility with the native Hadoop. Our model simulation and experiment in real system shows our new strategy can achieve better data redundancy and load balance to reduce I/O interference. Execution time of applications such as RandomWriter, Text Sort and Word Count are reduced by up to 33% and 10% on average.

[1]  Alan L. Cox,et al.  Concurrent Direct Network Access for Virtual Machine Monitors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[2]  Karsten Schmidt,et al.  Towards Flash Disk Use in Databases - Keeping Performance While Saving Energy? , 2009, BTW.

[3]  Philippe Bonnet,et al.  uFLIP: Understanding Flash IO Patterns , 2009, CIDR.

[4]  Hai Jin,et al.  CLOUDLET: towards mapreduce implementation on virtual machines , 2009, HPDC '09.

[5]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[6]  Muli Ben-Yehuda,et al.  Quantitative Comparison of Xen and KVM , 2008 .

[7]  Thomas Sandholm,et al.  MapReduce optimization using regulated dynamic prioritization , 2009, SIGMETRICS '09.

[8]  S. K. Nandy,et al.  I/O Device Virtualization in the Multi-core era, a QoS Perspective , 2009, 2009 Workshops at the Grid and Pervasive Computing Conference.

[9]  Jeanna Neefe Matthews,et al.  Quantifying the performance isolation properties of virtualization systems , 2007, ExpCS '07.

[10]  Insup Lee,et al.  Real-Time MapReduce Scheduling , 2010 .

[11]  Brian D. Noble,et al.  The end-to-end performance effects of parallel TCP sockets on a lossy wide-area network , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[12]  Jun Fang,et al.  Evaluating I/O Scheduler in Virtual Machines for Mapreduce Application , 2010, 2010 Ninth International Conference on Grid and Cloud Computing.

[13]  Matei Zaharia,et al.  Job Scheduling for Multi-User MapReduce Clusters , 2009 .

[14]  Eitan Altman,et al.  Parallel TCP Sockets: Simple Model, Throughput and Validation , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[15]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[16]  Jeffrey Shafer,et al.  I/O virtualization bottlenecks in cloud computing today , 2010 .

[17]  Himabindu Pucha,et al.  Towards Optimizing Hadoop Provisioning in the Cloud , 2009, HotCloud.

[18]  Scott Rixner,et al.  Network Virtualization: Breaking the Performance Barrier , 2008, ACM Queue.

[19]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .