Purity: Building Fast, Highly-Available Enterprise Flash Storage from Commodity Components

Although flash storage has largely replaced hard disks in consumer class devices, enterprise workloads pose unique challenges that have slowed adoption of flash in ``performance tier'' storage appliances. In this paper, we describe Purity, the foundation of Pure Storage's Flash Arrays, the first all-flash enterprise storage system to support compression, deduplication, and high-availability. Purity borrows techniques from modern database and key-value storage architectures, and introduces novel storage primitives that have wide applicability to data management systems. For instance, all writes in Purity are monotonic, and deletions are handled using an atomic predicate-based tuple elision primitive. Purity's redundancy mechanisms are optimized for SSD failure modes and performance characteristics, allowing for fast recovery from component failures and lower space overhead than the best hard disk systems. We built deduplication and data compression schemes atop these primitives. Flash changes storage capacity/performance tradeoffs: unlike disk-based systems, flash deployments are rarely performance bound. A single Purity appliance can provide over 7GiB/s of throughput on 32KiB random I/Os, even through multiple device failures, and while providing asynchronous off-site replication. Typical installations have 99.9% latencies under 1ms, and production arrays average 5.4x data reduction and 99.999% availability. Purity takes advantage of storage performance increasing more rapidly than computational performance to build a simpler (with respect to engineering, installation, and management) scale-up storage appliance that supports hundreds of terabytes of highly-available, high-performance storage. The resulting performance and capacity supports many customer deployments of multiple applications, including scale-out and parallel systems, such as MongoDB and Oracle RAC, on a single Purity appliance.

[1]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[2]  Butler W. Lampson,et al.  On-line data compression in a log-structured file system , 1992, ASPLOS V.

[3]  Bingsheng He,et al.  Tree Indexing on Flash Disks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[4]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[5]  David Hung-Chang Du,et al.  Deferred updates for flash-based storage , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[6]  Dutch T. Meyer,et al.  Strata: High-Performance Scalable Storage on Virtualized Non-volatile Memory , 2014, FAST 2014.

[7]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[8]  Balakrishna R. Iyer,et al.  Data Compression Support in Databases , 1994, VLDB.

[9]  Michael A. Bender,et al.  Cache-oblivious streaming B-trees , 2007, SPAA '07.

[10]  Steven Swanson,et al.  The bleak future of NAND flash memory , 2012, FAST.

[11]  Dutch T. Meyer,et al.  Strata: scalable high-performance storage on virtualized non-volatile memory , 2014, FAST.

[12]  Dennis Shasha,et al.  The dangers of replication and a solution , 1996, SIGMOD '96.

[13]  Yang Wang,et al.  Robustness in the Salus Scalable Block Store , 2013, NSDI.

[14]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[15]  Ryan Johnson,et al.  Evaluating and repairing write performance on flash devices , 2009, DaMoN '09.

[16]  Suman Nath,et al.  Cheap and Large CAMs for High Performance Data-Intensive Networked Systems , 2010, NSDI.

[17]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[18]  Chris Jermaine,et al.  The partitioned exponential file for database storage management , 2007, The VLDB Journal.

[19]  Eric A. Brewer,et al.  Rose: compressed, log-structured replication , 2008, Proc. VLDB Endow..

[20]  Jeffrey Dean,et al.  Designs, Lessons and Advice from Building Large Distributed Systems , 2009 .

[21]  Sang-Won Lee,et al.  SFS: random write considered harmful in solid state drives , 2012, FAST.

[22]  Jin Li,et al.  FlashStore: High Throughput Persistent Key-Value Store , 2010, Proc. VLDB Endow..

[23]  Ethan L. Miller,et al.  Screaming fast Galois field arithmetic using intel SIMD instructions , 2013, FAST.

[24]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[25]  E. L. Miller,et al.  Building Flexible , Fault-Tolerant Flash-based Storage Systems , 2009 .

[26]  Dahlia Malkhi,et al.  CORFU: A Shared Log Design for Flash Clusters , 2012, NSDI.

[27]  Tao Zou,et al.  Tango: distributed data structures over a shared log , 2013, SOSP.

[28]  Eric A. Brewer,et al.  Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..

[29]  Joseph M. Hellerstein,et al.  Consistency Analysis in Bloom: a CALM and Collected Approach , 2011, CIDR.

[30]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[31]  Jin Li,et al.  SkimpyStash: RAM space skimpy key-value store on flash-based storage , 2011, SIGMOD '11.

[32]  Himanshu Yadava,et al.  The Berkeley DB Book , 2007 .

[33]  Stephen M. Rumble,et al.  Log-structured memory for DRAM-based storage , 2014, FAST.

[34]  Irving L. Traiger,et al.  Granularity of Locks and Degrees of Consistency in a Shared Data Base , 1998, IFIP Working Conference on Modelling in Data Base Management Systems.

[35]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[36]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[37]  Tian Luo,et al.  CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of Flash Memory based Solid State Drives , 2011, FAST.

[38]  Raghu Ramakrishnan,et al.  bLSM: a general purpose log structured merge tree , 2012, SIGMOD Conference.

[39]  D. Andersen,et al.  A Fast Array of Wimpy Nodes , 2008 .

[40]  Bin Fan,et al.  SILT: a memory-efficient, high-performance key-value store , 2011, SOSP.

[41]  David J. DeWitt,et al.  How to barter bits for chronons: compression and bandwidth trade offs for database scans , 2007, SIGMOD '07.

[42]  Steven Swanson,et al.  The Harey Tortoise: Managing Heterogeneous Write Performance in SSDs , 2013, USENIX Annual Technical Conference.

[43]  Ethan L. Miller,et al.  Adding aggressive error correction to a high-performance compressing flash file system , 2009, EMSOFT '09.

[44]  David Flynn,et al.  DFS: A file system for virtualized flash storage , 2010, TOS.

[45]  Ramesh K. Sitaraman,et al.  Lazy-Adaptive Tree: An Optimized Index Structure for Flash Devices , 2009, Proc. VLDB Endow..

[46]  Goetz Graefe,et al.  The five-minute rule ten years later, and other computer storage rules of thumb , 1997, SGMD.

[47]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[48]  Jan Van den Bussche,et al.  Positive Dedalus programs tolerate non-causality , 2014, J. Comput. Syst. Sci..

[49]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[50]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[51]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[52]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[53]  Wolfgang Lehner,et al.  Efficient transaction processing in SAP HANA database: the end of a column store myth , 2012, SIGMOD Conference.

[54]  Anand Sivasubramaniam,et al.  Leveraging Value Locality in Optimizing NAND Flash-based SSDs , 2011, FAST.