Failure-atomic msync(): a simple and efficient mechanism for preserving the integrity of durable data

Preserving the integrity of application data across updates is difficult if power outages and system crashes may occur during updates. Existing approaches such as relational databases and transactional key-value stores restrict programming flexibility by mandating narrow data access interfaces. We have designed, implemented, and evaluated an approach that strengthens the semantics of a standard operating system primitive while maintaining conceptual simplicity and supporting highly flexible programming: Failureatomic msync() commits changes to a memory-mapped file atomically, even in the presence of failures. Our Linux implementation of failure-atomic msync() has preserved application data integrity across hundreds of whole-machine power interruptions and exhibits good microbenchmark performance on both spinning disks and solid-state storage. Failure-atomic msync() supports higher layers of fully general programming abstraction, e.g., a persistent heap that easily slips beneath the C++ Standard Template Library. An STL <map> built atop failure-atomic msync() outperforms several local key-value stores that support transactional updates. We integrated failure-atomic msync() into the Kyoto Tycoon key-value server by modifying exactly one line of code; our modified server reduces response times by 26--43% compared to Tycoon's existing transaction support while providing the same data integrity guarantees. Compared to a Tycoon server setup that makes almost no I/O (and therefore provides no support for data durability and integrity over failures), failure-atomic msync() incurs a three-fold response time increase on a fast Flash-based SSD---an acceptable cost of data reliability for many.

[1]  Michael Stonebraker,et al.  Operating system support for database management , 1981, CACM.

[2]  Mark Lillibridge,et al.  Understanding the robustness of SSDS under power fault , 2013, FAST.

[3]  Qi Wang,et al.  A 20nm 1.8V 8Gb PRAM with 40MB/s program bandwidth , 2012, 2012 IEEE International Solid-State Circuits Conference.

[4]  Orion Hodson,et al.  Whole-system Persistence with Non-volatile Memories , 2012 .

[5]  Orion Hodson,et al.  Whole-system persistence , 2012, ASPLOS XVII.

[6]  Rachid Guerraoui,et al.  Introduction to reliable distributed programming , 2006 .

[7]  Paolo Mattavelli,et al.  A 4 Mb LV MOS-Selected Embedded Phase Change Memory in 90 nm Standard CMOS Technology , 2011, IEEE Journal of Solid-State Circuits.

[8]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[9]  Andrea C. Arpaci-Dusseau,et al.  Analysis and Evolution of Journaling File Systems , 2005, USENIX Annual Technical Conference, General Track.

[10]  Rajesh K. Gupta,et al.  NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories , 2011, ASPLOS XVI.

[11]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[12]  Lidong Zhou,et al.  Transactional Flash , 2008, OSDI.

[13]  Roy H. Campbell,et al.  Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory , 2011, FAST.

[14]  Peter M. Chen,et al.  Free transactions with Rio Vista , 1997, SOSP.

[15]  Malcolm P. Atkinson,et al.  Algorithms for a persistent heap , 1983, Softw. Pract. Exp..

[16]  Kai Shen,et al.  FIOS: a fair, efficient flash I/O scheduler , 2012, FAST.

[17]  Jason Flinn,et al.  Rethink the sync , 2006, OSDI '06.

[18]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[19]  Won Kim,et al.  Transaction management in an object-oriented database system , 1988, SIGMOD '88.

[20]  Christopher Frost,et al.  Better I/O through byte-addressable, persistent memory , 2009, SOSP '09.

[21]  Eric A. Brewer,et al.  Stasis: flexible transactional storage , 2006, OSDI '06.

[22]  Irving L. Traiger,et al.  System R: relational approach to database management , 1976, TODS.

[23]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[24]  Jinpeng Wei,et al.  Software Persistent Memory , 2012, USENIX Annual Technical Conference.

[25]  Yale N. Patt,et al.  Soft updates: a solution to the metadata update problem in file systems , 2000 .

[26]  Andrew P. Black,et al.  Understanding transactions in the operating system context , 1991, OPSR.

[27]  David K. Gifford,et al.  Concurrent compacting garbage collection of a persistent heap , 1993, SOSP '93.

[28]  Donald E. Porter,et al.  Operating System Transactions , 2009, SOSP '09.

[29]  Robert C. Daley,et al.  The Multics virtual memory , 1972, Commun. ACM.

[30]  Andrew P. Black,et al.  Understanding transactions in the operating in the operating system context , 1990, EW 4.

[31]  Terence Kelly,et al.  Composable Reliability for Asynchronous Systems , 2012, USENIX Annual Technical Conference.

[32]  Mahadev Satyanarayanan,et al.  Lightweight recoverable virtual memory , 1993, SOSP '93.

[33]  Margo I. Seltzer,et al.  Journaling Versus Soft Updates: Asynchronous Meta-data Protection in File Systems , 2000, USENIX Annual Technical Conference, General Track.

[34]  Michael M. Swift,et al.  Mnemosyne: lightweight persistent memory , 2011, ASPLOS XVI.

[35]  Andrea Leganza Approved for External Publication , 2005 .