LCRC: A Dependency-Aware Cache Management Policy for Spark

Memory is a constrained resource for in-memory big data computing systems. Efficient memory management plays a pivotal role in performance improvement for these systems. However, simple history-based cache replacement strategies, such as Least Recently Used (LRU), usually have poor performance when applied in cluster applications, due to their lack of data dependency knowledge. Even though Least Reference Count (LRC) can be aware of dependency by giving blocks with bigger reference count high priority to reside in memory. However, these blocks will not be accessed in some stages during their entire life cycle, leading to available memory deterioration. To eliminate this shortcoming, we propose LCRC, a dependency-aware cache management policy that considers both intra-stage and inter-stage dependency. By providing a prefetching mechanism, we can rewrite these inter-stages accessed blocks into memory before its next use. Experiments show that compared with previous methods the proposed mechanism can improve computing performance over 65%.

[1]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[2]  Carlo Curino,et al.  Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications , 2015, SIGMOD Conference.

[3]  Alfred V. Aho,et al.  Principles of Optimal Page Replacement , 1971, J. ACM.

[4]  Anna R. Karlin,et al.  Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling , 1996, TOCS.

[5]  Jim Zelenka,et al.  Informed prefetching and caching , 1995, SOSP.

[6]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[7]  Kenli Li,et al.  Selection and replacement algorithms for memory performance improvement in Spark , 2016, Concurr. Comput. Pract. Exp..

[8]  Hai Jin,et al.  LCS: An Efficient Data Eviction Strategy for Spark , 2016, International Journal of Parallel Programming.

[9]  Li Zhang,et al.  MEMTUNE: Dynamic Memory Management for In-Memory Data Analytic Platforms , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[10]  Khaled Ben Letaief,et al.  LRC: Dependency-aware cache management for data analytics clusters , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[11]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Ke Zhang,et al.  A New Scheme for Cache Optimization Based on Cluster Computing Framework Spark , 2015, 2015 8th International Symposium on Computational Intelligence and Design (ISCID).

[14]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.