Orion: A Distributed File System for Non-Volatile Main Memory and RDMA-Capable Networks

High-performance, byte-addressable non-volatile main memories (NVMMs) force system designers to rethink tradeoffs throughout the system stack, often leading to dramatic changes in system architecture. Conventional distributed file systems are a prime example. When faster NVMM replaces block-based storage, the dramatic improvement in storage performance makes networking and software overhead a critical bottleneck. In this paper, we present Orion, a distributed file system for NVMM-based storage. By taking a clean slate design and leveraging the characteristics of NVMM and high-speed, RDMA-based networking, Orion provides high-performance metadata and data access while maintaining the byte addressability of NVMM. Our evaluation shows Orion achieves performance comparable to local NVMM file systems and outperforms existing distributed file systems by a large margin.

[1]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[2]  Animesh Trivedi,et al.  DaRPC: Data Center RPC , 2014, SoCC.

[3]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[4]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[5]  Jian Xu,et al.  NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System , 2017, SOSP.

[6]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[7]  Tom Talpey,et al.  RDMA Durable Write Commit , 2016 .

[8]  Pradeep Dubey,et al.  Architecting to achieve a billion requests per second throughput on a single key-value store server platform , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[9]  Antony I. T. Rowstron,et al.  XFabric: A Reconfigurable In-Rack Network for Rack-Scale Computers , 2016, NSDI.

[10]  Robert B. Ross,et al.  On the duality of data-intensive file system design: Reconciling HDFS and PVFS , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  Christopher Frost,et al.  Better I/O through byte-addressable, persistent memory , 2009, SOSP '09.

[12]  Peter F. Corbett,et al.  The Direct Access File System , 2003, FAST.

[13]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.

[14]  Thomas R. Gross,et al.  RStore: A Direct-Access DRAM-based Data Store , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[15]  Haibo Chen,et al.  Fast and general distributed transactions using RDMA and HTM , 2016, EuroSys.

[16]  Bill Nitzberg,et al.  Distributed shared memory: a survey of issues and algorithms , 1991, Computer.

[17]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[18]  Veljko M. Milutinovic,et al.  Distributed shared memory: concepts and systems , 1997, IEEE Parallel Distributed Technol. Syst. Appl..

[19]  David G. Andersen,et al.  Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[20]  Yiying Zhang,et al.  LITE Kernel RDMA Support for Datacenter Applications , 2017, SOSP.

[21]  Dahlia Malkhi,et al.  CORFU: A distributed shared log , 2013, TOCS.

[22]  Kang Chen,et al.  RFP: When RPC is Faster than Server-Bypass with RDMA , 2017, EuroSys.

[23]  Steven Swanson,et al.  A study of application performance with non-volatile main memory , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[24]  Erez Zadok,et al.  Filebench: A Flexible Framework for File System Benchmarking , 2016, login Usenix Mag..

[25]  Veljko M. Milutinovic,et al.  A survey of distributed shared memory systems , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[26]  Michael M. Swift,et al.  Aerie: flexible file-system interfaces to storage-class memory , 2014, EuroSys '14.

[27]  Julian Satran,et al.  Internet Small Computer Systems Interface (iSCSI) , 2004, RFC.

[28]  Paulo Guedes,et al.  The PerDiS FS: a transactional file system for a distributed persistent store , 1998, ACM SIGOPS European Workshop.

[29]  Jim Zelenka,et al.  File server scaling with network-attached secure disks , 1997, SIGMETRICS '97.

[30]  Alex Davies,et al.  Scale out with GlusterFS , 2013 .

[31]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[32]  Haibo Chen,et al.  Soft Updates Made Simple and Fast on Non-volatile Memory , 2017, USENIX Annual Technical Conference.

[33]  Hemal Shah,et al.  A study of iSCSI extensions for RDMA (iSER) , 2003, NICELI '03.

[34]  Yiying Zhang,et al.  Distributed shared persistent memory , 2017, SoCC.

[35]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[36]  Hiroshi Tezuka,et al.  Pin-down cache: a virtual memory management technique for zero-copy communication , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[37]  Jian Yang,et al.  Mojim: A Reliable and Highly-Available Non-Volatile Memory System , 2015, ASPLOS.

[38]  Tao Li,et al.  Octopus: an RDMA-enabled Distributed Persistent Memory File System , 2017, USENIX ATC.

[39]  Timothy Roscoe,et al.  Arrakis , 2014, OSDI.

[40]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[41]  Duane Mills,et al.  19.7 A 16Gb ReRAM with 200MB/s write and 1GB/s read in 27nm technology , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[42]  Brent Callaghan,et al.  NFS over RDMA , 2003, NICELI '03.

[43]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[44]  A. L. Narasimha Reddy,et al.  SCMFS: A file system for Storage Class Memory , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[45]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[46]  Jian Xu,et al.  NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories , 2016, FAST.

[47]  Marcos K. Aguilera,et al.  Remote regions: a simple abstraction for remote memory , 2018, USENIX ATC.

[48]  Satoshi Takaya,et al.  7.5 A 3.3ns-access-time 71.2μW/MHz 1Mb embedded STT-MRAM using physically eliminated read-disturb scheme and normally-off memory architecture , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[49]  Marc Shapiro,et al.  Larchant-RDOSS: a Distributed Shared Persistent Memory and its Garbage Collector , 1995, WDAG.

[50]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[51]  Peter Braam,et al.  The Lustre Storage Architecture , 2019, ArXiv.

[52]  GhemawatSanjay,et al.  The Google file system , 2003 .

[53]  Hitesh Ballani,et al.  R2C2: A Network Stack for Rack-scale Computers , 2015, Comput. Commun. Rev..

[54]  Dhabaleswar K. Panda,et al.  PVFS over InfiniBand: design and performance evaluation , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[55]  Miguel Castro,et al.  No compromises: distributed transactions with consistency, availability, and performance , 2015, SOSP.

[56]  Dhabaleswar K. Panda,et al.  High Performance Design for HDFS with Byte-Addressability of NVM and RDMA , 2016, ICS.

[57]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[58]  Sanjay Kumar,et al.  System software for persistent memory , 2014, EuroSys '14.

[59]  Andrea C. Arpaci-Dusseau,et al.  Designing a True Direct-Access File System with DevFS , 2018, FAST.

[60]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.