论文信息 - Architecture of a distributed storage that combines file system, memory and computation in a single layer

Architecture of a distributed storage that combines file system, memory and computation in a single layer

Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non-shared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper, we propose a single system called Pangea that can manage all data—both intermediate and long-lived data, and their buffer/caching, page replacement, data placement optimization, and failure recovery—all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.

[1] Hiren Patel,et al. Computation Reuse in Analytics Job Service at Microsoft , 2018, SIGMOD Conference.

[2] Aamer Jaleel,et al. High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[3] Chun Zhang,et al. Automating physical database design in a parallel database , 2002, SIGMOD '02.

[4] Michael Stonebraker,et al. C-Store: A Column-oriented DBMS , 2005, VLDB.

[5] Anna R. Karlin,et al. Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling , 1996, TOCS.

[6] Joseph K. Bradley,et al. Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[7] Michele Garetto,et al. Efficient analysis of caching strategies under dynamic content popularity , 2014, 2015 IEEE Conference on Computer Communications (INFOCOM).

[8] Scott Shenker,et al. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[9] Ju Wang,et al. Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[10] Jochen Liedtke,et al. Toward real microkernels , 1996, CACM.

[11] Gregory R. Ganger,et al. Attribute-Based Prediction of File Properties , 2003 .

[12] GhemawatSanjay,et al. The Google file system , 2003 .

[13] Yanpei Chen,et al. Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[14] Reza Sherkat,et al. Native Store Extension for SAP HANA , 2019, Proc. VLDB Endow..

[15] Gerhard Weikum,et al. The LRU-K page replacement algorithm for database disk buffering , 1993, SIGMOD Conference.

[16] Tom White,et al. Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale (4. ed., revised & updated) , 2012 .

[17] Dimitris S. Papailiopoulos,et al. XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[18] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[19] Tony Tung,et al. Scaling Memcache at Facebook , 2013, NSDI.

[20] Ronald Fagin,et al. Efficient Calculation of Expected Miss Ratios in the Independent Reference Model , 1978, SIAM J. Comput..

[21] Carsten Binnig,et al. An Architecture for Compiling UDF-centric Workflows , 2015, Proc. VLDB Endow..

[22] Sandy Irani,et al. Cost-Aware WWW Proxy Caching Algorithms , 1997, USENIX Symposium on Internet Technologies and Systems.

[23] Yoshiyasu Doi,et al. Managing Non-Volatile Memory in Database Systems , 2018, SIGMOD Conference.

[24] S. Wittevrongel,et al. Queueing Systems , 2019, Introduction to Stochastic Processes and Simulation.

[25] Brad Fitzpatrick,et al. Distributed caching with memcached , 2004 .

[26] Andrea C. Arpaci-Dusseau,et al. Explicit Control in the Batch-Aware Distributed File System , 2004, NSDI.

[27] Hai Jin,et al. Lifetime-Based Memory Management for Distributed Data Processing Systems , 2016, Proc. VLDB Endow..

[28] Neal Young,et al. The K-Server Dual and Loose Competitiveness for Paging , 1991, On-Line Algorithms.

[29] Michael Haubenschild,et al. LeanStore: In-Memory Data Management beyond Main Memory , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[30] EmerJoel,et al. High performance cache replacement using re-reference interval prediction (RRIP) , 2010 .

[31] WeikumGerhard,et al. The LRU-K page replacement algorithm for database disk buffering , 1993 .

[32] Jochen Liedtke,et al. TOWARD REAL MICROKERNELS The inefficient, inflexible first generation inspired development of the vastly improved second generation, which may yet support a variety of operating systems. , 1996 .

[33] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[34] Ian Rae,et al. F1: A Distributed SQL Database That Scales , 2013, Proc. VLDB Endow..

[35] Yuanyuan Tian,et al. CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[36] Sang Lyul Min,et al. LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies , 2001, IEEE Trans. Computers.

[37] Willy Zwaenepoel,et al. IO-Lite: a unified I/O buffering and caching system , 1999, TOCS.

[38] Joe Arnold,et al. OpenStack Swift: Using, Administering, and Developing for Swift Object Storage , 2014 .

[39] Virgílio A. F. Almeida,et al. On the intrinsic locality properties of Web reference streams , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[40] Mihnea Andrei,et al. SAP HANA Adoption of Non-Volatile Memory , 2017, Proc. VLDB Endow..

[41] Carlo Curino,et al. Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[42] Chris Jermaine,et al. Pangea: Monolithic Distributed Storage for Data Analytics , 2018, Proc. VLDB Endow..

[43] Karan Gupta,et al. GPFS-SNC: An enterprise storage framework for virtual-machine clouds , 2011, IBM J. Res. Dev..

[44] Srikanth Kandula,et al. PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[45] Vivek R. Narasayya,et al. Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[46] Haoyuan Li,et al. Alluxio: A Virtual Distributed File System , 2018 .

[47] Yi Lu,et al. AdaptDB: Adaptive Partitioning for Distributed Joins , 2017, Proc. VLDB Endow..

[48] Hideaki Kimura,et al. FOEDUS: OLTP Engine for a Thousand Cores and NVRAM , 2015, SIGMOD Conference.

[49] Jingren Zhou,et al. SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[50] Yuanyuan Zhou,et al. The Multi-Queue Replacement Algorithm for Second Level Buffer Caches , 2001, USENIX Annual Technical Conference, General Track.

[51] Marco Cesati,et al. Understanding the Linux Kernel - from I / O ports to process management: covers Linux Kernel version 2.4 (2. ed.) , 2005 .

[52] Joy Arulraj,et al. Multi-Tier Buffer Management and Storage System Design for Non-Volatile Memory , 2019, ArXiv.

[53] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[54] Samuel Madden,et al. A robust partitioning scheme for ad-hoc query workloads , 2017, SoCC.

[55] Ismael Ripoll,et al. TLSF: a new dynamic memory allocator for real-time systems , 2004, Proceedings. 16th Euromicro Conference on Real-Time Systems, 2004. ECRTS 2004..

[56] Carlos Maltzahn,et al. Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[57] Chris Jermaine,et al. PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development , 2017, SIGMOD Conference.

[58] Wei Lin,et al. Advanced partitioning techniques for massively distributed computation , 2012, SIGMOD Conference.

[59] Vinay Setty,et al. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[60] Margo I. Seltzer,et al. File classification in self-* storage systems , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[61] David J. DeWitt,et al. An evaluation of buffer management strategies for relational database systems , 1986, Algorithmica.

[62] Donald Yeung,et al. Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.