Splinter: Bare-Metal Extensions for Multi-Tenant Low-Latency Storage

In-memory key-value stores that use kernel-bypass networking serve millions of operations per second per machine with microseconds of latency. They are fast in part because they are simple, but their simple interfaces force applications to move data across the network. This is inefficient for operations that aggregate over large amounts of data, and it causes delays when traversing complex data structures. Ideally, applications could push small functions to storage to avoid round trips and data movement; however, pushing code to these fast systems is challenging. Any extra complexity for interpreting or isolating code cuts into their latency and throughput benefits. We present Splinter, a low-latency key-value store that clients extend by pushing code to it. Splinter is designed for modern multi-tenant data centers; it allows mutually distrusting tenants to write their own fine-grained extensions and push them to the store at runtime. The core of Splinter's design relies on type- and memory-safe extension code to avoid conventional hardware isolation costs. This still allows for bare-metal execution, avoids data copying across trust boundaries, and makes granular storage functions that perform less than a microsecond of compute practical. Our measurements show that Splinter can process 3.5 million remote extension invocations per second with a median round-trip latency of less than 9 µs at densities of more than 1,000 tenants per server. We provide an implementation of Facebook's TAO as an 800 line extension that, when pushed to a Splinter server, improves performance by 400 Kop/s to perform 3.2 Mop/s over online graph data with 30 µs remote access times.

[1]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[2]  Derek Dreyer,et al.  RustBelt: securing the foundations of the rust programming language , 2017, Proc. ACM Program. Lang..

[3]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[4]  Michael Hamburg,et al.  Meltdown: Reading Kernel Memory from User Space , 2018, USENIX Security Symposium.

[5]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[6]  Bryan Ford,et al.  Vx32: Lightweight User-level Sandboxing on the x86 , 2008, USENIX Annual Technical Conference.

[7]  Michael Stonebraker,et al.  The VoltDB Main Memory DBMS , 2013, IEEE Data Eng. Bull..

[8]  Mark Handley,et al.  Wedge: Splitting Applications into Reduced-Privilege Compartments , 2008, NSDI.

[9]  Bennet S. Yee,et al.  Native Client: A Sandbox for Portable, Untrusted x86 Native Code , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[10]  Craig Freedman,et al.  Compilation in the Microsoft SQL Server Hekaton Engine , 2014, IEEE Data Eng. Bull..

[11]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[12]  Philip Levis,et al.  Multiprogramming a 64kB Computer Safely and Efficiently , 2017, SOSP.

[13]  Miguel Castro,et al.  No compromises: distributed transactions with consistency, availability, and performance , 2015, SOSP.

[14]  Edouard Bugnion,et al.  ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[15]  Christoforos E. Kozyrakis,et al.  IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[16]  Michael Stonebraker,et al.  Using Write Protected Data Structures To Improve Software Fault Tolerance in Highly Available Database Management Systems , 1991, VLDB.

[17]  Brian N. Bershad,et al.  Extensibility safety and performance in the SPIN operating system , 1995, SOSP.

[18]  Brian N. Bershad,et al.  Improving the reliability of commodity operating systems , 2005, TOCS.

[19]  Stephen McCamant,et al.  Evaluating SFI for a CISC Architecture , 2006, USENIX Security Symposium.

[20]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[21]  David E. Culler,et al.  Scalable, distributed data structures for internet service construction , 2000, OSDI.

[22]  Amit A. Levy,et al.  Comet: An active distributed key-value store , 2010, OSDI.

[23]  Dan S. Wallach,et al.  Extensible security architectures for Java , 1997, SOSP.

[24]  Leonid Ryzhyk,et al.  System Programming in Rust: Beyond Safety , 2017, HotOS.

[25]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[26]  Carlos Maltzahn,et al.  Malacology: A Programmable Storage System , 2017, EuroSys.

[27]  Scott Shenker,et al.  NetBricks: Taking the V out of NFV , 2016, OSDI.

[28]  David A. Patterson,et al.  Attack of the killer microseconds , 2017, Commun. ACM.

[29]  Magdalena Balazinska,et al.  Gaussian Mixture Models Use-Case: In-Memory Analysis with Myria , 2015, IMDM@VLDB.

[30]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[31]  Margo I. Seltzer,et al.  Dealing with disaster: surviving misbehaved kernel extensions , 1996, OSDI '96.

[32]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[33]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[34]  Tao Zou,et al.  Tango: distributed data structures over a shared log , 2013, SOSP.

[35]  James R. Larus,et al.  Singularity: rethinking the software stack , 2007, OPSR.

[36]  Dahlia Malkhi,et al.  CORFU: A Shared Log Design for Flash Clusters , 2012, NSDI.

[37]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[38]  Haibo Chen,et al.  Fast In-Memory Transaction Processing Using RDMA and HTM , 2017, ACM Trans. Comput. Syst..

[39]  Michael Hamburg,et al.  Spectre Attacks: Exploiting Speculative Execution , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[40]  Eric Eide,et al.  Introducing CloudLab: Scientific Infrastructure for Advancing Cloud Architectures and Applications , 2014, login Usenix Mag..

[41]  Qian Li,et al.  Arachne: Core-Aware Thread Management , 2018, OSDI.

[42]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[43]  Thomas Neumann,et al.  Efficiently Compiling Efficient Query Plans for Modern Hardware , 2011, Proc. VLDB Endow..

[44]  Ashish Gupta,et al.  The RAMCloud Storage System , 2015, ACM Trans. Comput. Syst..

[45]  Pradeep Dubey,et al.  Architecting to achieve a billion requests per second throughput on a single key-value store server platform , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[46]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX ATC.

[47]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[48]  Miguel Castro,et al.  RDMA Reads: To Use or Not to Use? , 2017, IEEE Data Eng. Bull..

[49]  David A. Patterson,et al.  ISTORE: introspective storage for data-intensive network services , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[50]  Carl A. Waldspurger,et al.  Speculative Buffer Overflows: Attacks and Defenses , 2018, ArXiv.

[51]  Peter J. Haas,et al.  Simulation of database-valued markov chains using SimSQL , 2013, SIGMOD '13.

[52]  Anton Burtsev,et al.  Lightweight capability domains: towards decomposing the Linux kernel , 2015, OPSR.

[53]  Robert Wahbe,et al.  Efficient software-based fault isolation , 1994, SOSP '93.

[54]  Michael Kaminsky,et al.  Datacenter RPCs can be General and Fast , 2018, NSDI.

[55]  Michael Stonebraker,et al.  The POSTGRES next generation database management system , 1991, CACM.

[56]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[57]  Marc Najork,et al.  Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.