Pangea: Monolithic Distributed Storage for Data Analytics

Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non-shared execution data in separate systems such as distributed file system like HDFS, in-memory file system like Alluxio and computation framework like Spark. Such layering introduces significant performance and management costs for copying data across layers redundantly and deciding proper resource allocation for all layers. In this paper we propose a single system called Pangea that can manage all data---both intermediate and long-lived data, and their buffer/caching, data placement optimization, and failure recovery---all in one monolithic storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.

[1]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[2]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[3]  Hai Jin,et al.  Lifetime-Based Memory Management for Distributed Data Processing Systems , 2016, Proc. VLDB Endow..

[4]  Gregory R. Ganger,et al.  Attribute-Based Prediction of File Properties , 2003 .

[5]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[6]  Gerhard Weikum,et al.  The LRU-K page replacement algorithm for database disk buffering , 1993, SIGMOD Conference.

[7]  G. J. A. Stern,et al.  Queueing Systems, Volume 2: Computer Applications , 1976 .

[8]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[9]  Ismael Ripoll,et al.  TLSF: a new dynamic memory allocator for real-time systems , 2004, Proceedings. 16th Euromicro Conference on Real-Time Systems, 2004. ECRTS 2004..

[10]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[11]  JinHai,et al.  Lifetime-based memory management for distributed data processing systems , 2016, VLDB 2016.

[12]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[13]  Sang Lyul Min,et al.  LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies , 2001, IEEE Trans. Computers.

[14]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[15]  Yuanyuan Zhou,et al.  The Multi-Queue Replacement Algorithm for Second Level Buffer Caches , 2001, USENIX Annual Technical Conference, General Track.

[16]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[17]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[18]  GhemawatSanjay,et al.  The Google file system , 2003 .

[19]  Margo I. Seltzer,et al.  File classification in self-* storage systems , 2004 .

[20]  Hiren Patel,et al.  Computation Reuse in Analytics Job Service at Microsoft , 2018, SIGMOD Conference.

[21]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[22]  Willy Zwaenepoel,et al.  IO-Lite: a unified I/O buffering and caching system , 1999, TOCS.

[23]  Joe Arnold,et al.  OpenStack Swift: Using, Administering, and Developing for Swift Object Storage , 2014 .

[24]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[25]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[26]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[27]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[28]  Ronald Fagin,et al.  Efficient Calculation of Expected Miss Ratios in the Independent Reference Model , 1978, SIAM J. Comput..

[29]  Sandy Irani,et al.  Cost-Aware WWW Proxy Caching Algorithms , 1997, USENIX Symposium on Internet Technologies and Systems.

[30]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[31]  Virgílio A. F. Almeida,et al.  On the intrinsic locality properties of Web reference streams , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[32]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[33]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[34]  Haoyuan Li,et al.  Alluxio: A Virtual Distributed File System , 2018 .

[35]  Michele Garetto,et al.  Efficient analysis of caching strategies under dynamic content popularity , 2014, 2015 IEEE Conference on Computer Communications (INFOCOM).

[36]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[37]  Jochen Liedtke,et al.  Toward real microkernels , 1996, CACM.

[38]  S. Wittevrongel,et al.  Queueing Systems , 2019, Introduction to Stochastic Processes and Simulation.

[39]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[40]  Anna R. Karlin,et al.  Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling , 1996, TOCS.

[41]  Daniel Pierre Bovet,et al.  Understanding the Linux Kernel , 2000 .

[42]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[43]  Andrea C. Arpaci-Dusseau,et al.  Explicit Control in the Batch-Aware Distributed File System , 2004, NSDI.

[44]  Neal Young,et al.  The K-Server Dual and Loose Competitiveness for Paging , 1991, On-Line Algorithms.

[45]  Carsten Binnig,et al.  An Architecture for Compiling UDF-centric Workflows , 2015, Proc. VLDB Endow..

[46]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[47]  David J. DeWitt,et al.  An evaluation of buffer management strategies for relational database systems , 1986, Algorithmica.

[48]  Donald Yeung,et al.  Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.

[49]  Umar Farooq Minhas,et al.  SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures , 2014, Proc. VLDB Endow..

[50]  Karan Gupta,et al.  GPFS-SNC: An enterprise storage framework for virtual-machine clouds , 2011, IBM J. Res. Dev..

[51]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[52]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[53]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[54]  Chris Jermaine,et al.  PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development , 2017, SIGMOD Conference.

[55]  Wei Lin,et al.  Advanced partitioning techniques for massively distributed computation , 2012, SIGMOD Conference.

[56]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..