Architecture of a distributed storage that combines file system, memory and computation in a single layer

Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non-shared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper, we propose a single system called Pangea that can manage all data—both intermediate and long-lived data, and their buffer/caching, page replacement, data placement optimization, and failure recovery—all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.

[1]  Hiren Patel,et al.  Computation Reuse in Analytics Job Service at Microsoft , 2018, SIGMOD Conference.

[2]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[3]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[4]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[5]  Anna R. Karlin,et al.  Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling , 1996, TOCS.

[6]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[7]  Michele Garetto,et al.  Efficient analysis of caching strategies under dynamic content popularity , 2014, 2015 IEEE Conference on Computer Communications (INFOCOM).

[8]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[9]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[10]  Jochen Liedtke,et al.  Toward real microkernels , 1996, CACM.

[11]  Gregory R. Ganger,et al.  Attribute-Based Prediction of File Properties , 2003 .

[12]  GhemawatSanjay,et al.  The Google file system , 2003 .

[13]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[14]  Reza Sherkat,et al.  Native Store Extension for SAP HANA , 2019, Proc. VLDB Endow..

[15]  Gerhard Weikum,et al.  The LRU-K page replacement algorithm for database disk buffering , 1993, SIGMOD Conference.

[16]  Tom White,et al.  Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale (4. ed., revised & updated) , 2012 .

[17]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[18]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[19]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[20]  Ronald Fagin,et al.  Efficient Calculation of Expected Miss Ratios in the Independent Reference Model , 1978, SIAM J. Comput..

[21]  Carsten Binnig,et al.  An Architecture for Compiling UDF-centric Workflows , 2015, Proc. VLDB Endow..

[22]  Sandy Irani,et al.  Cost-Aware WWW Proxy Caching Algorithms , 1997, USENIX Symposium on Internet Technologies and Systems.

[23]  Yoshiyasu Doi,et al.  Managing Non-Volatile Memory in Database Systems , 2018, SIGMOD Conference.

[24]  S. Wittevrongel,et al.  Queueing Systems , 2019, Introduction to Stochastic Processes and Simulation.

[25]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[26]  Andrea C. Arpaci-Dusseau,et al.  Explicit Control in the Batch-Aware Distributed File System , 2004, NSDI.

[27]  Hai Jin,et al.  Lifetime-Based Memory Management for Distributed Data Processing Systems , 2016, Proc. VLDB Endow..

[28]  Neal Young,et al.  The K-Server Dual and Loose Competitiveness for Paging , 1991, On-Line Algorithms.

[29]  Michael Haubenschild,et al.  LeanStore: In-Memory Data Management beyond Main Memory , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[30]  EmerJoel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010 .

[31]  WeikumGerhard,et al.  The LRU-K page replacement algorithm for database disk buffering , 1993 .

[32]  Jochen Liedtke,et al.  TOWARD REAL MICROKERNELS The inefficient, inflexible first generation inspired development of the vastly improved second generation, which may yet support a variety of operating systems. , 1996 .

[33]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[34]  Ian Rae,et al.  F1: A Distributed SQL Database That Scales , 2013, Proc. VLDB Endow..

[35]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[36]  Sang Lyul Min,et al.  LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies , 2001, IEEE Trans. Computers.

[37]  Willy Zwaenepoel,et al.  IO-Lite: a unified I/O buffering and caching system , 1999, TOCS.

[38]  Joe Arnold,et al.  OpenStack Swift: Using, Administering, and Developing for Swift Object Storage , 2014 .

[39]  Virgílio A. F. Almeida,et al.  On the intrinsic locality properties of Web reference streams , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[40]  Mihnea Andrei,et al.  SAP HANA Adoption of Non-Volatile Memory , 2017, Proc. VLDB Endow..

[41]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[42]  Chris Jermaine,et al.  Pangea: Monolithic Distributed Storage for Data Analytics , 2018, Proc. VLDB Endow..

[43]  Karan Gupta,et al.  GPFS-SNC: An enterprise storage framework for virtual-machine clouds , 2011, IBM J. Res. Dev..

[44]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[45]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[46]  Haoyuan Li,et al.  Alluxio: A Virtual Distributed File System , 2018 .

[47]  Yi Lu,et al.  AdaptDB: Adaptive Partitioning for Distributed Joins , 2017, Proc. VLDB Endow..

[48]  Hideaki Kimura,et al.  FOEDUS: OLTP Engine for a Thousand Cores and NVRAM , 2015, SIGMOD Conference.

[49]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[50]  Yuanyuan Zhou,et al.  The Multi-Queue Replacement Algorithm for Second Level Buffer Caches , 2001, USENIX Annual Technical Conference, General Track.

[51]  Marco Cesati,et al.  Understanding the Linux Kernel - from I / O ports to process management: covers Linux Kernel version 2.4 (2. ed.) , 2005 .

[52]  Joy Arulraj,et al.  Multi-Tier Buffer Management and Storage System Design for Non-Volatile Memory , 2019, ArXiv.

[53]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[54]  Samuel Madden,et al.  A robust partitioning scheme for ad-hoc query workloads , 2017, SoCC.

[55]  Ismael Ripoll,et al.  TLSF: a new dynamic memory allocator for real-time systems , 2004, Proceedings. 16th Euromicro Conference on Real-Time Systems, 2004. ECRTS 2004..

[56]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[57]  Chris Jermaine,et al.  PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development , 2017, SIGMOD Conference.

[58]  Wei Lin,et al.  Advanced partitioning techniques for massively distributed computation , 2012, SIGMOD Conference.

[59]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[60]  Margo I. Seltzer,et al.  File classification in self-* storage systems , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[61]  David J. DeWitt,et al.  An evaluation of buffer management strategies for relational database systems , 1986, Algorithmica.

[62]  Donald Yeung,et al.  Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.