RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications

This article is an eight-year retrospective on development priorities for RocksDB, a key-value store developed at Facebook that targets large-scale distributed systems and that is optimized for Solid State Drives (SSDs). We describe how the priorities evolved over time as a result of hardware trends and extensive experiences running RocksDB at scale in production at a number of organizations: from optimizing write amplification, to space amplification, to CPU utilization. We describe lessons from running large-scale applications, including that resource allocation needs to be managed across different RocksDB instances, that data formats need to remain backward- and forward-compatible to allow incremental software rollouts, and that appropriate support for database replication and backups are needed. Lessons from failure handling taught us that data corruption errors needed to be detected earlier and that data integrity protection mechanisms are needed at every layer of the system. We describe improvements to the key-value interface. We describe a number of efforts that in retrospect proved to be misguided. Finally, we describe a number of open problems that could benefit from future research.

[1]  An Chen,et al.  A review of emerging non-volatile memory (NVM) technologies and applications , 2016 .

[2]  Laurie A. Williams,et al.  Continuous Deployment at Facebook and OANDA , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[3]  Ittai Abraham,et al.  PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees , 2017, SOSP.

[4]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[5]  Sriram Sankar,et al.  Silent Data Corruptions at Scale , 2021, ArXiv.

[6]  Yili Gong,et al.  On Integration of Appends and Merges in Log-Structured Merge Trees , 2019, ICPP.

[7]  Indranil Gupta,et al.  Stateful Scalable Stream Processing at LinkedIn , 2017, Proc. VLDB Endow..

[8]  James C. Corbett,et al.  Spanner , 2013 .

[9]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[10]  Abutalib Aghayev,et al.  File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution , 2019, SOSP.

[11]  Kenneth Kreutz-Delgado,et al.  A mathematical model of the trim command in NAND-flash SSDs , 2012, ACM-SE '12.

[12]  Craig Freedman,et al.  Hekaton: SQL server's memory-optimized OLTP engine , 2013, SIGMOD '13.

[13]  Jason Cong,et al.  An efficient design and implementation of LSM-tree based key-value store on open-channel SSD , 2014, EuroSys '14.

[14]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[15]  Vagelis Hristidis,et al.  Comparison and evaluation of state-of-the-art LSM merge policies , 2021, The VLDB Journal.

[16]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[17]  David E. Culler,et al.  Cores that don't count , 2021, HotOS.

[18]  Jin-Soo Kim,et al.  ForestDB: A Fast Key-Value Storage System for Variable-Length String Keys , 2016, IEEE Transactions on Computers.

[19]  Herman Lee,et al.  MyRocks , 2020 .

[20]  Keren Ouaknine,et al.  Optimization of RocksDB for Redis on Flash , 2017, ICCDA '17.

[21]  Kai Ren,et al.  SlimDB: A Space-Efficient Key-Value Storage Engine For Semi-Sorted Data , 2017, Proc. VLDB Endow..