Detecting cache-related bugs in Spark applications

Apache Spark has been widely used to build big data applications. Spark utilizes the abstraction of Resilient Distributed Dataset (RDD) to store and retrieve large-scale data. To reduce duplicate computation of an RDD, Spark can cache the RDD in memory and then reuse it later, thus improving performance. Spark relies on application developers to enforce caching decisions by using persist() and unpersist() APIs, e.g., which RDD is persisted and when the RDD is persisted / unpersisted. Incorrect RDD caching decisions can cause duplicate computations, or waste precious memory resource, thus introducing serious performance degradation in Spark applications. In this paper, we propose CacheCheck, to automatically detect cache-related bugs in Spark applications. We summarize six cache-related bug patterns in Spark applications, and then dynamically detect cache-related bugs by analyzing the execution traces of Spark applications. We evaluate CacheCheck on six real-world Spark applications. The experimental result shows that CacheCheck detects 72 previously unknown cache-related bugs, and 28 of them have been fixed by developers.

[1]  Khaled Ben Letaief,et al.  LRC: Dependency-aware cache management for data analytics clusters , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[2]  Stratis Ioannidis,et al.  Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).

[3]  Jun Wei,et al.  An Experimental Evaluation of Garbage Collectors on Big Data Applications , 2019, Proc. VLDB Endow..

[4]  Jingling Xue,et al.  Static memory leak detection using full-sparse value-flow analysis , 2012, ISSTA 2012.

[5]  Jun Yan,et al.  Light-Weight, Inter-Procedural and Callback-Aware Resource Leak Detection for Android Apps , 2016, IEEE Transactions on Software Engineering.

[6]  Shen Li,et al.  Stark: Optimizing In-Memory Computing for Dynamic Dataset Collections , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[7]  Onur Mutlu,et al.  Panthera: holistic memory management for big data processing over hybrid memories , 2019, PLDI.

[8]  Ke Zhang,et al.  A New Scheme for Cache Optimization Based on Cluster Computing Framework Spark , 2015, 2015 8th International Symposium on Computational Intelligence and Design (ISCID).

[9]  Jun Yan,et al.  Characterizing and detecting resource leaks in Android applications , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[10]  Hidemoto Nakada,et al.  Understanding and improving disk-based intermediate data caching in Spark , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[11]  Sigmund Cherem,et al.  Practical memory leak detection using guarded value-flow analysis , 2007, PLDI '07.

[12]  Lu Fang,et al.  Skyway: Connecting Managed Heaps in Distributed Big Data Systems , 2018, ASPLOS.

[13]  Minyi Guo,et al.  Simba: Efficient In-Memory Spatial Analytics , 2016, SIGMOD Conference.

[14]  Karine Zeitouni,et al.  AstroSpark: towards a distributed data server for big data in astronomy , 2016, SIGSPATIAL PhD Symposium.