CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics Frameworks

To process data from IoTs and wearable devices, analysis tasks are often offloaded to the cloud. As the amount of sensing data ever increases, optimizing the data analytics frameworks is critical to the performance of processing sensed data. A key approach to speed up the performance of data analytics frameworks in the cloud is caching intermediate data, which is used repeatedly in iterative computations. Existing analytics engines implement caching with various approaches. Some use run-time mechanisms with dynamic profiling and others rely on programmers to decide data to cache. Even though caching discipline has been investigated long enough in computer system research, recent data analytics frameworks still leave a room to optimize. As sophisticated caching should consider complex execution contexts such as cache capacity, size of data to cache, victims to evict, etc., no general solution often exists for data analytics frameworks. In this paper, we propose an application-specific cost-capacity-aware caching scheme for in-memory data analytics frameworks. We use a cost model, built from multiple representative inputs, and an execution flow analysis, extracted from DAG schedule, to select primary candidates to cache among intermediate data. After the caching candidate is determined, the optimal caching is automatically selected during execution even if the programmers no longer manually determine the caching for the intermediate data. We implemented our scheme in Apache Spark and experimentally evaluated our scheme on HiBench benchmarks. Compared to the caching decisions in the original benchmarks, our scheme increases the performance by 27% on sufficient cache memory and by 11% on insufficient cache memory, respectively.

[1]  Caching Cost Model for In-memory Data Analytics Framework , 2020, SMA.

[2]  David Cunningham,et al.  M3R: Increased performance for in-memory Hadoop jobs , 2012, Proc. VLDB Endow..

[3]  Feng Luo,et al.  Dynamic Management of In-Memory Storage for Efficiently Integrating Compute-and Data-Intensive Computing on HPC Systems , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[4]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[5]  Khaled Ben Letaief,et al.  LRC: Dependency-aware cache management for data analytics clusters , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[6]  Xiaobo Zhou,et al.  Reference-distance Eviction and Prefetching for Cache Management in Spark , 2018, ICPP.

[7]  Fabian Hueske,et al.  Apache Flink , 2019, Encyclopedia of Big Data Technologies.

[8]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[9]  Reynold Xin,et al.  GraphX: Unifying Data-Parallel and Graph-Parallel Analytics , 2014, ArXiv.

[10]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[11]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[12]  Manuel Caeiro,et al.  Collection and Processing of Data from Wrist Wearable Devices in Heterogeneous and Multiple-User Scenarios , 2016, Sensors.

[13]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[14]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[15]  Matthias Weidlich,et al.  Meta-Dataflows: Efficient Exploratory Dataflow Jobs , 2018, SIGMOD Conference.

[16]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[17]  Hai Jin,et al.  LCS: An Efficient Data Eviction Strategy for Spark , 2016, International Journal of Parallel Programming.

[18]  Fabio Porto,et al.  Automatic Caching Decision for Scientific Dataflow Execution in Apache Spark , 2018, BeyondMR@SIGMOD.

[19]  Robert Stevens,et al.  Directed Acyclic Graph , 2004 .

[20]  Erci Xu,et al.  Neutrino: Revisiting Memory Caching for Iterative Data Analytics , 2016, HotStorage.

[21]  Sean Owen,et al.  Advanced Analytics with Spark: Patterns for Learning from Data at Scale , 2015 .

[22]  Li Zhang,et al.  MEMTUNE: Dynamic Memory Management for In-Memory Data Analytic Platforms , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[23]  Leslie G. Valiant,et al.  Direct Bulk-Synchronous Parallel Algorithms , 1992, J. Parallel Distributed Comput..

[24]  Wei Wang,et al.  OpuS: Fair and Efficient Cache Sharing for In-Memory Data Analytics , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).