Assise: Performance and Availability via NVM Colocation in a Distributed File System

Disaggregated, or non-local, file storage has become a common design pattern in cloud systems, offering benefits of resource pooling and server specialization, where the inherent overhead of separating compute and storage is mostly hidden by storage device latency. We take an alternate approach, motivated by the commercial availability of very low latency non-volatile memory (NVM). By colocating computation and NVM storage, we can provide applications much higher I/O performance, sub-second application failover, and strong consistency. To demonstrate this, we built the Assise distributed file system, based on a persistent, replicated cache coherence protocol for managing a set of colocated NVM storage devices as a layer. Unlike disaggregated file stores, Assise avoids the read and write amplification of page granularity operations. Instead, remote NVM serves as an intermediate, byte-addressable cache between colocated NVM and slower storage, such as SSDs. We compare Assise to Ceph/Bluestore, NFS, and Octopus on a cluster with Intel Optane DC persistent memory modules and SSDs for common cloud applications and benchmarks, such as LevelDB, Postfix, MinuteSort, and FileBench. We find that Assise improves write latency up to 22x, throughput up to 56x, fail-over time up to 103x, and scales up to 6x better than Ceph, while providing stronger consistency semantics.

[1]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[2]  Michael Stonebraker,et al.  Operating system support for database management , 1981, CACM.

[3]  GhemawatSanjay,et al.  The Google file system , 2003 .

[4]  Thomas E. Anderson,et al.  Strata: A Cross Media File System , 2017, SOSP.

[5]  Erez Zadok,et al.  To FUSE or Not to FUSE: Performance of User-Space File Systems , 2017, FAST.

[6]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[7]  Gabriel Antoniu,et al.  Tailwind: Fast and Atomic RDMA-based Replication , 2018, USENIX ATC.

[8]  David Robinson,et al.  Network File System (NFS) version 4 Protocol , 2003, RFC.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Andy Rudoff,et al.  Persistent Memory Programming , 2017, login Usenix Mag..

[11]  Erez Zadok,et al.  Filebench: A Flexible Framework for File System Benchmarking , 2016, login Usenix Mag..

[12]  Robert Birke,et al.  Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[13]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[14]  Asim Kadav,et al.  Blizzard: Fast, Cloud-scale Block Storage for Cloud-oblivious Applications , 2014, NSDI.

[15]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[16]  Roger M. Needham,et al.  Grapevine: an exercise in distributed computing , 1982, CACM.

[17]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[18]  Scott Watanabe Solaris 10 ZFS Essentials , 2010 .

[19]  Emin Gün Sirer,et al.  Latency and bandwidth-minimizing failure detectors , 2007, EuroSys '07.

[20]  Andrea C. Arpaci-Dusseau,et al.  Consistency without ordering , 2012, FAST.

[21]  R. Shackleton A Quantitative Approach , 2005 .

[22]  Xiao Liu,et al.  Basic Performance Measurements of the Intel Optane DC Persistent Memory Module , 2019, ArXiv.

[23]  Robert D. Russell,et al.  A Performance Study to Guide RDMA Programming Decisions , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[24]  Dhabaleswar K. Panda,et al.  High Performance Design for HDFS with Byte-Addressability of NVM and RDMA , 2016, ICS.

[25]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[26]  Abutalib Aghayev,et al.  File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution , 2019, SOSP.

[27]  Kai Chen,et al.  URSA: Hybrid Block Storage for Cloud-Scale Virtual Disks , 2019, EuroSys.

[28]  David R. Cheriton,et al.  Leases: an efficient fault-tolerant mechanism for distributed file cache consistency , 1989, SOSP '89.

[29]  Marcos K. Aguilera,et al.  Detecting failures in distributed systems with the Falcon spy network , 2011, SOSP.

[30]  Jian Yang,et al.  Orion: A Distributed File System for Non-Volatile Main Memory and RDMA-Capable Networks , 2019, FAST.

[31]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[32]  Jie Wu,et al.  Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory , 2018, OSDI.

[33]  Jian Xu,et al.  Finding and Fixing Performance Pathologies in Persistent Memory Software Stacks , 2019, ASPLOS.

[34]  Yang Wang,et al.  Robustness in the Salus Scalable Block Store , 2013, NSDI.

[35]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[36]  Jeanna Neefe Matthews,et al.  Serverless network file systems , 1996, TOCS.

[37]  Yiying Zhang,et al.  LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation , 2018, OSDI.

[38]  Vincent A. Traag,et al.  From Louvain to Leiden: guaranteeing well-connected communities , 2018, Scientific Reports.

[39]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[40]  Andrea C. Arpaci-Dusseau,et al.  Optimistic crash consistency , 2013, SOSP.

[41]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[42]  Ittai Abraham,et al.  PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees , 2017, SOSP.

[43]  Srinivasan Seshan,et al.  Hyperloop: group-based NIC-offloading to accelerate replicated transactions in multi-tenant storage systems , 2018, SIGCOMM.

[44]  Tao Li,et al.  Octopus: an RDMA-enabled Distributed Persistent Memory File System , 2017, USENIX ATC.

[45]  Ashish Gupta,et al.  The RAMCloud Storage System , 2015, ACM Trans. Comput. Syst..