Selection and replacement algorithms for memory performance improvement in Spark

As a parallel computation framework, Spark can cache repeatedly resilient distribution datasets (RDDs) partitions in different nodes to speed up the process of computation. However, Spark does not have a good mechanism to select reasonable RDDs to cache their partitions in limited memory. In this paper, we propose a novel selection algorithm, by which Spark can automatically select the RDDs to cache their partitions in memory according to the number of use for RDDs. Our selection algorithm speeds up iterative computations. Nevertheless, when many new RDDs are chosen to cache their partitions in memory while limited memory has been full of them, the system will adopt the least recently used (LRU) replacement algorithm. However, the LRU algorithm only considers whether the RDDs partitions are recently used while ignoring other factors such as the computation cost and so on. We also put forward a novel replacement algorithm called weight replacement (WR) algorithm, which takes comprehensive consideration of the partitions computation cost, the number of use for partitions, and the sizes of the partitions. Experiment results show that with our selection algorithm, Spark calculates faster than without the algorithm, and we find that Spark with WR algorithm shows better performance. Copyright © 2015 John Wiley & Sons, Ltd.

[1]  Xuyun Zhang,et al.  SaC‐FRAPP: a scalable and cost‐effective framework for privacy preservation over big data on cloud , 2013, Concurr. Comput. Pract. Exp..

[2]  Remzi H. Arpaci-Dusseau,et al.  Storage-Aware Caching: Revisiting Caching for Heterogeneous Storage Systems , 2002, FAST.

[3]  J. T. Robinson,et al.  Data cache management using frequency-based replacement , 1990, SIGMETRICS '90.

[4]  Bin Wu,et al.  Log analysis in cloud computing environment with Hadoop and Spark , 2013, 2013 5th IEEE International Conference on Broadband Network & Multimedia Technology.

[5]  Debabala Swain,et al.  AWRP: Adaptive Weight Ranking Policy for Improving Cache Performance , 2011, ArXiv.

[6]  Ck Cheng,et al.  The Age of Big Data , 2015 .

[7]  Sang Lyul Min,et al.  On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies , 1999, SIGMETRICS '99.

[8]  Scott Shenker,et al.  Fast and Interactive Analytics over Hadoop Data with Spark , 2012, login Usenix Mag..

[9]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[10]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[11]  Sang Lyul Min,et al.  LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies , 2001, IEEE Trans. Computers.

[12]  Kanghee Kim,et al.  RT-PLRU: A New Paging Scheme for Real--Time Execution of Program Codes on NAND Flash Memory for Portable Media Players , 2011, IEEE Transactions on Computers.

[13]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[14]  Neal E. Young,et al.  On-Line File Caching , 2002, SODA '98.

[15]  Bob Cramblitt,et al.  InterviewAn interview with Ping Fu , 2009, Commun. ACM.

[16]  Swain Debabala,et al.  AWRP: Adaptive Weight Ranking Policy for Improving Cache Performance , 2011 .

[17]  Rajkumar Buyya,et al.  Deadline Based Resource Provisioningand Scheduling Algorithm for Scientific Workflows on Clouds , 2014, IEEE Transactions on Cloud Computing.