CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

Running big data analytics frameworks in the cloud is becoming increasingly important, but their resource managers in the current form are not designed to consider virtualized environments. In this work, we investigate various levels of data locality in a virtualized environment, ranging from rack locality to memory locality. Exploiting extra fine-grained levels of data locality in a virtualized environment, our memory locality-aware scheduling algorithm effectively increases the cache hit ratio and thereby reduces network traffic and disk I/O. However, a high cache hit ratio does not necessarily imply a shorter job execution time in MapReduce applications. To resolve this issue, we develop the Cache-Affinity and Virtualization-Aware (CAVA) resource manager, which measures the cache affinity of MapReduce applications at runtime and efficiently manages distributed in-memory caches of a limited size by assigning high priority to applications that have high cache affinity. The proposed memory locality-aware scheduling algorithm is also integrated into the CAVA resource manager. Our extensive experimental study shows that CAVA exhibits overall good performance over various workloads composed of multiple big data analytics applications by considering the fine-grained data locality levels in virtualized clusters and by efficiently using scarce memory resources.

[1]  Ling Liu,et al.  Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Ashraf Aboulnaga,et al.  ReStore: Reusing Results of MapReduce Jobs , 2012, Proc. VLDB Endow..

[3]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[4]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[5]  Alexandra Fedorova,et al.  Performance Implications of Cache Affinity on Multicore Processors , 2008, Euro-Par.

[6]  Beomseok Nam,et al.  In-Memory Caching Orchestration for Hadoop , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[7]  Xindong Wu,et al.  A Distributed Cache for Hadoop Distributed File System in Real-Time Cloud Services , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[8]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[9]  Jie Wu,et al.  Dache: A data aware caching for big-data applications using the MapReduce framework , 2013, 2013 Proceedings IEEE INFOCOM.

[10]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[11]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[12]  Scott Shenker,et al.  Disk-Locality in Datacenter Computing Considered Irrelevant , 2011, HotOS.

[13]  David Cunningham,et al.  M3R: Increased performance for in-memory Hadoop jobs , 2012, Proc. VLDB Endow..

[14]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[15]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[16]  Seung Ryoul Maeng,et al.  Locality-aware dynamic VM reconfiguration on MapReduce clouds , 2012, HPDC '12.

[17]  Mark S. Squillante,et al.  Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling , 1993, IEEE Trans. Parallel Distributed Syst..

[18]  Aman Kansal,et al.  Q-clouds: managing performance interference effects for QoS-aware clouds , 2010, EuroSys '10.

[19]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[20]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[21]  Min Li,et al.  CAM: a topology aware minimum cost flow based resource manager for MapReduce applications in the cloud , 2012, HPDC '12.

[22]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[23]  Rini T. Kaushik,et al.  GreenHDFS: towards an energy-conserving, storage-efficient, hybrid Hadoop compute cluster , 2010 .

[24]  Bill Jia,et al.  Storage and performance optimization of long tail key access in a social network , 2013, CloudDP '13.

[25]  Cheng-Zhong Xu,et al.  Interference and locality-aware task scheduling for MapReduce applications in virtual clusters , 2013, HPDC.

[26]  Hovav Shacham,et al.  Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds , 2009, CCS.

[27]  Jaehyuk Huh,et al.  Interference Management for Distributed Parallel Applications in Consolidated Clusters , 2016, ASPLOS.