vLocality: Revisiting Data Locality for MapReduce in Virtualized Clouds

Recent years have witnessed a surge of new generation applications involving big data. The de facto framework for big data processing, MapReduce, has been increasingly embraced by both academic and industrial users. Data locality seeks to co-locate computation with data, which effectively reduces remote data access and improves MapReduce’s performance in physical machine clusters. State-of-the-art public clouds heavily rely on virtualization to enable resource sharing and scaling for massive users, however. In this article, through real-world experiments, we show strong evidence that the conventional notion of data locality is unfortunately not always beneficial for MapReduce in a virtualized environment. The observations suggest that the measure of node-local must be extended to distinguish physical and virtual entities. We develop vLocality, a comprehensive and practical solution for data locality in virtualized environments. It incorporates a novel storage architecture that efficiently mitigates the shared disk contention, and an enhanced task scheduling algorithm that prioritizes co-located VMs. We have implemented a prototype of vLocality based on Hadoop 1.2.1, and have validated its effectiveness on a typical virtualized cloud platform consisting of 22 nodes. Our experimental results demonstrate that vLocality can improve the job finish time to around a quarter of that for typical Hadoop benchmark applications.

[1]  Cheng-Zhong Xu,et al.  Interference and locality-aware task scheduling for MapReduce applications in virtual clusters , 2013, HPDC.

[2]  Athanasios V. Vasilakos,et al.  Managing Performance Overhead of Virtual Machines in Cloud Computing: A Survey, State of the Art, and Future Directions , 2014, Proceedings of the IEEE.

[3]  Haiyang Wang,et al.  On interference-aware provisioning for cloud-based big data processing , 2013, 2013 IEEE/ACM 21st International Symposium on Quality of Service (IWQoS).

[4]  Guangwen Yang,et al.  Location-Aware MapReduce in Virtual Cloud , 2011, 2011 International Conference on Parallel Processing.

[5]  Mohammad Hammoud,et al.  Locality-Aware Reduce Task Scheduling for MapReduce , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[6]  Geoffrey C. Fox,et al.  Investigation of Data Locality in MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[7]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[8]  Bo Li,et al.  iAware: Making Live Migration of Virtual Machines Interference-Aware in the Cloud , 2014, IEEE Transactions on Computers.

[9]  Albert G. Greenberg,et al.  Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.

[10]  Hai Jin,et al.  Building a network highway for big data: architecture and challenges , 2014, IEEE Network.

[11]  Seung Ryoul Maeng,et al.  Locality-aware dynamic VM reconfiguration on MapReduce clouds , 2012, HPDC '12.

[12]  Shengzhong Feng,et al.  Improving Data Locality of MapReduce by Scheduling in Homogeneous Computing Environments , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[13]  Cristina L. Abad,et al.  DARE: Adaptive Data Replication for Efficient Cluster Scheduling , 2011, 2011 IEEE International Conference on Cluster Computing.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Hai Jin,et al.  Evaluating MapReduce on Virtual Machines: The Hadoop Case , 2009, CloudCom.