PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database

PolarFS is a distributed file system with ultra-low latency and high availability, designed for the POLARDB database service, which is now available on the Alibaba Cloud. PolarFS utilizes a lightweight network stack and I/O stack in user-space, taking full advantage of the emerging techniques like RDMA, NVMe, and SPDK. In this way, the end-to-end latency of PolarFS has been reduced drastically and our experiments show that the write latency of PolarFS is quite close to that of local file system on SSD. To keep replica consistency while maximizing I/O throughput for PolarFS, we develop ParallelRaft, a consensus protocol derived from Raft, which breaks Raft's strict serialization by exploiting the out-of-order I/O completion tolerance capability of databases. ParallelRaft inherits the understand-ability and easy implementation of Raft while providing much better I/O scalability for PolarFS. We also describe the shared storage architecture of PolarFS, which gives a strong support for POLARDB.

[1]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[2]  Dhabaleswar K. Panda,et al.  Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[3]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[4]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[5]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[6]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[7]  Hyeonsang Eom,et al.  Optimizing the Block I/O Subsystem for Fast Storage Devices , 2014, ACM Trans. Comput. Syst..

[8]  Christoforos E. Kozyrakis,et al.  ReFlex: Remote Flash ≈ Local Flash , 2017, ASPLOS.

[9]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[10]  Flavio Paiva Junqueira,et al.  Zab: High-performance broadcast for primary-backup systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[11]  Steven Swanson,et al.  Refactor, Reduce, Recycle: Restructuring the I/O Stack for the Future of Storage , 2013, Computer.

[12]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[13]  Qian Lin,et al.  PaxosStore: High-availability Storage Made Practical in WeChat , 2017, Proc. VLDB Endow..

[14]  Dahlia Malkhi,et al.  CORFU: A Shared Log Design for Flash Clusters , 2012, NSDI.

[15]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[16]  Haibo Chen,et al.  Fast In-Memory Transaction Processing Using RDMA and HTM , 2017, ACM Trans. Comput. Syst..

[17]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[18]  Pradeep Dubey,et al.  Architecting to achieve a billion requests per second throughput on a single key-value store server platform , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[19]  Animesh Trivedi,et al.  Wimpy Nodes with 10GbE: Leveraging One-Sided Operations in Soft-RDMA to Boost Memcached , 2012, USENIX ATC.

[20]  GhemawatSanjay,et al.  The Google file system , 2003 .

[21]  Sayantan Sur,et al.  Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.

[22]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX Annual Technical Conference.

[23]  Leslie Lamport,et al.  Disk Paxos , 2003, Distributed Computing.

[24]  Cheng Wang,et al.  APUS: fast and scalable paxos on RDMA , 2017, SoCC.

[25]  Anurag Gupta,et al.  Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases , 2017, SIGMOD Conference.

[26]  Rajesh K. Gupta,et al.  Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[27]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[28]  Frank Hady,et al.  When poll is better than interrupt , 2012, FAST.

[29]  Yair Amir,et al.  Paxos for System Builders: an overview , 2008, LADIS '08.

[30]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[31]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .