Leaky Buffer: A Novel Abstraction for Relieving Memory Pressure from Cluster Data Processing Frameworks

The shift to the in-memory data processing paradigm has had a major influence on the development of cluster data processing frameworks. Numerous frameworks from the industry, open source community and academia are adopting the in-memory paradigm to achieve functionalities and performance breakthroughs. However, despite the advantages of these in-memory frameworks, in practice they are susceptible to memory-pressure related performance collapse and failures. The contributions of this paper are twofold. First, we conduct a detailed diagnosis of the memory pressure problem and identify three preconditions for the performance collapse. These preconditions not only explain the problem but also shed light on the possible solution strategies. Second, we propose a novel programming abstraction called the leaky bufferthat eliminates one of the preconditions, thereby addressing the underlying problem. We have implemented a leaky buffer enabled hashtable in Spark, and we believe it is also able to substitute the hashtable that performs similar hash aggregation operations in any other programs or data processing frameworks. Experiments on a range of memory intensive aggregation operations show that the leaky buffer abstraction can drastically reduce the occurrence of memoryrelated failures, improve performance by up to 507 percent and reduce memory usage by up to 87.5 percent.

[1]  Daoyuan Wang,et al.  Tuning Java Garbage Collection for Spark Applications , 2015 .

[2]  Beng Chin Ooi,et al.  Llama: leveraging columnar storage for scalable join processing in the MapReduce framework , 2011, SIGMOD '11.

[3]  David Cunningham,et al.  M3R: Increased performance for in-memory Hadoop jobs , 2012, Proc. VLDB Endow..

[4]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[5]  Shrinivas Joshi Java Garbage Collection Characteristics and Tuning Guidelines for Apache Hadoop TeraSort Workload , 2010 .

[6]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[7]  Evangelos P. Markatos,et al.  Implementation of a Reliable Remote Memory Pager , 1996, USENIX Annual Technical Conference.

[8]  John Kubiatowicz,et al.  Trash Day: Coordinating Garbage Collection in Distributed Systems , 2015, HotOS.

[9]  Michael Dahlin,et al.  Cooperative caching: using remote client memory to improve file system performance , 1994, OSDI '94.

[10]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[11]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[12]  Randy H. Katz,et al.  Topology-aware resource allocation for data-intensive workloads , 2011, Comput. Commun. Rev..

[13]  Carlo Curino,et al.  Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications , 2015, SIGMOD Conference.

[14]  L.Bharathi G.Sireesha,et al.  Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud , 2011, IEEE Transactions on Parallel and Distributed Systems.

[15]  Odej Kao,et al.  Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud , 2011, IEEE Transactions on Parallel and Distributed Systems.

[16]  Randy H. Katz,et al.  Topology-aware resource allocation for data-intensive workloads , 2010, APSys '10.

[17]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[18]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Christopher Olston,et al.  SpongeFiles: mitigating data skew in mapreduce using distributed memory , 2014, SIGMOD Conference.

[21]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[22]  Matei Zaharia,et al.  Resilient Distributed Datasets , 2016 .

[23]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[24]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.