Improving Apache Spark's Cache Mechanism with LRC-Based Method Using Bloom Filter

Memory-and-Disk caching is a common caching mechanism for temporal output in Apache Spark. However, it causes performance degradation when memory usage has reached its limit because of the Spark's LRU (Least Recently Used) based cache management. Existing studies have reported that replacement of LRU-based cache mechanism to LRC (Least Reference Count) based one that is a more accurate indicator of the likelihood of future data access. However, frequently used partitions cannot be determined because Spark accesses all of partitions for user-driven RDD operations, even if partitions do not include necessary data. In this paper, we propose a cache management method that enables allocating necessary partitions to the memory by introducing the bloom filter into existing methods. The bloom filter prevents unnecessary partitions from being processed because partitions are checked whether required data is contained. Furthermore, frequently used partitions can be properly determined by measuring the reference count of partitions. We implemented two architecture types, the driver-side bloom filter and the executor-side bloom filter, to consider the optimal place of the bloom filter. Evaluation results showed that the execution time of the driver-side implementation was reduced by 89% in a filter-test benchmark based on the LRC-based method.

[1]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[2]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[3]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[4]  Stratis Ioannidis,et al.  Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).

[5]  Hai Jin,et al.  LCS: An Efficient Data Eviction Strategy for Spark , 2016, International Journal of Parallel Programming.

[6]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[7]  Li Zhang,et al.  MEMTUNE: Dynamic Memory Management for In-Memory Data Analytic Platforms , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[8]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[9]  Hiroshi Matsuo,et al.  Adaptive Control of Apache Spark's Data Caching Mechanism Based on Workload Characteristics , 2018, 2018 6th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW).

[10]  Hidemoto Nakada,et al.  Understanding and improving disk-based intermediate data caching in Spark , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[11]  Carlo Curino,et al.  Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications , 2015, SIGMOD Conference.

[12]  Anita Shinde,et al.  RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE , 2013 .

[13]  Khaled Ben Letaief,et al.  LRC: Dependency-aware cache management for data analytics clusters , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[14]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.