Isotope: Transactional Isolation for Block Storage

Existing storage stacks are top-heavy and expect little from block storage. As a result, new high-level storage abstractions - and new designs for existing abstractions - are difficult to realize, requiring developers to implement from scratch complex functionality such as failure atomicity and fine-grained concurrency control. In this paper, we argue that pushing transactional isolation into the block store (in addition to atomicity and durability) is both viable and broadly useful, resulting in simpler high-level storage systems that provide strong semantics without sacrificing performance. We present Isotope, a new block store that supports ACID transactions over block reads and writes. Internally, Isotope uses a new multiversion concurrency control protocol that exploits fine-grained, sub-block parallelism in workloads and offers both strict serializability and snapshot isolation guarantees. We implemented several high-level storage systems over Isotope, including two key-value stores that implement the LevelDB API over a hashtable and B-tree, respectively, and a POSIX filesystem. We show that Isotope's block-level transactions enable systems that are simple (100s of lines of code), robust (i.e., providing ACID guarantees), and fast (e.g., 415 MB/s for random file writes). We also show that these systems can be composed using Isotope, providing applications with transactions across different high-level constructs such as files, directories and key-value pairs.

[1]  Andrea C. Arpaci-Dusseau,et al.  De-indirection for flash-based SSDs with nameless writes , 2012, FAST.

[2]  Rajesh Gupta,et al.  From ARIES to MARS: transaction support for next-generation, solid-state drives , 2013, SOSP.

[3]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[4]  Donald E. Porter,et al.  Operating System Transactions , 2009, SOSP '09.

[5]  Erez Zadok,et al.  Extending ACID semantics to the file system , 2007, TOS.

[6]  James R. Larus,et al.  Transactional Memory , 2006, Transactional Memory.

[7]  Marcos K. Aguilera,et al.  Transactional storage for geo-replicated systems , 2011, SOSP.

[8]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[9]  Wilson C. Hsieh,et al.  The logical disk: a new approach to improving file systems , 1994, SOSP '93.

[10]  Gregory R. Ganger,et al.  Blurring the Line Between Oses and Storage Devices (CMU-CS-01-166) , 2001 .

[11]  Garth A. Gibson,et al.  Highly concurrent shared storage , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[12]  Peter A. Dinda,et al.  Wayback: A User-level Versioning File System for Linux (Awarded Best Paper!) , 2004, USENIX Annual Technical Conference, FREENIX Track.

[13]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[14]  Lex Stein Stupid File Systems Are Better , 2005, HotOS.

[15]  Alexander A. Stepanov,et al.  Mime: a high performance parallel storage device with strong recovery guarantees , 1997 .

[16]  Angelos Bilas,et al.  Clotho: Transparent Data Versioning at the Block I/O Level , 2004, MSST.

[17]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[18]  Philip A. Bernstein,et al.  Optimistic concurrency control by melding trees , 2011, Proc. VLDB Endow..

[19]  Dutch T. Meyer,et al.  Parallax: virtual disks for virtual machines , 2008, Eurosys '08.

[20]  Bin Fan,et al.  MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing , 2013, NSDI.

[21]  S. Swanson,et al.  From ARIES to MARS : Reengineering Transaction Management for Next-Generation , Solid-State Drives , 2013 .

[22]  Vivek S. Pai,et al.  SSDAlloc: Hybrid SSD/RAM Memory Management Made Easy , 2011, NSDI.

[23]  James R. Larus,et al.  Transactional Memory, 2nd edition , 2010, Transactional Memory.

[24]  Jim Gray,et al.  A critique of ANSI SQL isolation levels , 1995, SIGMOD '95.

[25]  Rachid Guerraoui,et al.  On the correctness of transactional memory , 2008, PPoPP.

[26]  Lidong Zhou,et al.  Transactional Flash , 2008, OSDI.

[27]  Daniel J. Abadi,et al.  CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems , 2015, FAST.

[28]  Jon Howell,et al.  Flat Datacenter Storage , 2012, OSDI.

[29]  Norman C. Hutchinson,et al.  Deciding when to forget in the Elephant file system , 1999, SOSP.

[30]  Robert E. Tarjan,et al.  Making Data Structures Persistent , 1989, J. Comput. Syst. Sci..

[31]  Carlos Maltzahn,et al.  Flash on Rails: Consistent Flash Performance through Redundancy , 2014, USENIX Annual Technical Conference.

[32]  Andrea C. Arpaci-Dusseau,et al.  Semantically-Smart Disk Systems , 2003, FAST.

[33]  Dutch T. Meyer,et al.  Strata: scalable high-performance storage on virtualized non-volatile memory , 2014, FAST.

[34]  Mahadev Satyanarayanan,et al.  Lightweight Recoverable Virtual Memory , 1993, SOSP.

[35]  Peter M. Chen,et al.  Free transactions with Rio Vista , 1997, SOSP.

[36]  Mahesh Balakrishnan,et al.  Extending SSD Lifetimes with Disk-Based Write Caches , 2010, FAST.

[37]  Dhabaleswar K. Panda,et al.  MetaData persistence using storage class memory: experiences with flash-backed DRAM , 2013, INFLOW '13.

[38]  Michael M. Swift,et al.  FlashTier: a lightweight, consistent and durable storage cache , 2012, EuroSys '12.

[39]  Rajesh K. Gupta,et al.  NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories , 2011, ASPLOS XVI.

[40]  Michael Stonebraker,et al.  Implementation techniques for main memory database systems , 1984, SIGMOD '84.

[41]  Hakim Weatherspoon,et al.  Gecko: contention-oblivious disk arrays for cloud storage , 2013, FAST.

[42]  Michael M. Swift,et al.  Hathi: durable transactions for memory using flash , 2012, DaMoN '12.

[43]  Eric A. Brewer,et al.  Stasis: flexible transactional storage , 2006, OSDI '06.

[44]  Gregory R. Ganger,et al.  Object-based storage , 2003, IEEE Commun. Mag..

[45]  P. Desnoyers,et al.  Skylight—A Window on Shingled Disk Operation , 2015, FAST.

[46]  James R. Larus,et al.  Transactional memory , 2008, CACM.

[47]  Marc Najork,et al.  Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.

[48]  Alexander A. Stepanov,et al.  Loge: A Self-Organizing Disk Controller , 1991 .

[49]  Dahlia Malkhi,et al.  CORFU: A Shared Log Design for Flash Clusters , 2012, NSDI.

[50]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[51]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[52]  Michael A. Olson,et al.  The Design and Implementation of the Inversion File System , 1993, USENIX Winter.

[53]  David A. Patterson,et al.  Virtual log based file systems for a programmable disk , 1999, OSDI '99.

[54]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[55]  Erez Zadok,et al.  A Versatile and User-Oriented Versioning File System , 2004, FAST.

[56]  Michael M. Swift,et al.  Mnemosyne: lightweight persistent memory , 2011, ASPLOS XVI.