Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3

This paper reports our experience applying lightweight formal methods to validate the correctness of ShardStore, a new key-value storage node implementation for the Amazon S3 cloud object storage service. By "lightweight formal methods" we mean a pragmatic approach to verifying the correctness of a production storage node that is under ongoing feature development by a full-time engineering team. We do not aim to achieve full formal verification, but instead emphasize automation, usability, and the ability to continually ensure correctness as both software and its specification evolve over time. Our approach decomposes correctness into independent properties, each checked by the most appropriate tool, and develops executable reference models as specifications to be checked against the implementation. Our work has prevented 16 issues from reaching production, including subtle crash consistency and concurrency problems, and has been extended by non-formal-methods experts to check new features and properties as ShardStore has evolved.

[1]  Tom Ridge,et al.  Lem: reusable engineering of real-world semantics , 2014, ICFP.

[2]  K. Claessen,et al.  QuickCheck: a lightweight tool for random testing of Haskell programs , 2000, ICFP '00.

[3]  Thomas Ball,et al.  Finding and Reproducing Heisenbugs in Concurrent Programs , 2008, OSDI.

[4]  QadeerShaz,et al.  Compositional programming and testing of dynamic distributed systems , 2018 .

[5]  Daniel Jackson,et al.  Lightweight Formal Methods , 2001, FME.

[6]  Nicolas Christin,et al.  Push-Button Verification of File Systems via Crash Refinement , 2016, USENIX Annual Technical Conference.

[7]  Pandian Raju,et al.  Finding Crash-Consistency Bugs with Bounded Black-Box Crash Testing , 2018, OSDI.

[8]  Gerard J. Holzmann,et al.  The Model Checker SPIN , 1997, IEEE Trans. Software Eng..

[9]  Peter Müller,et al.  How do programmers use unsafe rust? , 2020, Proc. ACM Program. Lang..

[10]  Brian Demsky,et al.  CDSchecker: checking concurrent data structures written with C/C++ atomics , 2013, OOPSLA.

[11]  Yale N. Patt,et al.  Metadata update performance in file systems , 1994, OSDI '94.

[12]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[13]  Daniel Jackson,et al.  Software Abstractions - Logic, Language, and Analysis , 2006 .

[14]  Zvonimir Rakamaric,et al.  Verifying Rust Programs with SMACK , 2018, ATVA.

[15]  Xi Wang,et al.  Specifying and Checking File System Crash-Consistency Models , 2016, ASPLOS.

[16]  Daniel Jackson,et al.  Formal Modeling and Analysis of a Flash Filesystem in Alloy , 2008, ABZ.

[17]  Andreas Zeller,et al.  Simplifying failure-inducing input , 2000, ISSTA '00.

[18]  Alastair F. Donaldson,et al.  Test-Case Reduction via Test-Case Generation: Insights from the Hypothesis Reducer (Tool Insights Paper) , 2020, ECOOP.

[19]  Joseph Tassarotti,et al.  Verifying concurrent, crash-safe systems with Perennial , 2019, SOSP.

[20]  Gerard J. Holzmann,et al.  A mini challenge: build a verifiable filesystem , 2007, Formal Aspects of Computing.

[21]  Andrea C. Arpaci-Dusseau,et al.  WiscKey: Separating Keys from Values in SSD-conscious Storage , 2016, FAST.

[22]  Adam Chlipala,et al.  Using Crash Hoare logic for certifying the FSCQ file system , 2015, USENIX Annual Technical Conference.

[23]  Damien Zufferey,et al.  P: safe asynchronous event-driven programming , 2013, PLDI.

[24]  Junfeng Yang,et al.  EXPLODE: a lightweight, general system for finding serious storage system errors , 2006, OSDI '06.

[25]  D. Tullsen,et al.  Finding and Eliminating Timing Side-Channels in Crypto Code with Pitchfork , 2021 .

[26]  Martin Hofmann,et al.  Resource Aware ML , 2012, CAV.

[27]  Kathryn S. McKinley,et al.  Bounded partial-order reduction , 2013, OOPSLA.

[28]  Joseph Tassarotti,et al.  GoJournal: a verified, concurrent, crash-safe journaling system , 2021, OSDI.

[29]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[30]  Pravesh Kothari,et al.  A randomized scheduler with probabilistic guarantees of finding bugs , 2010, ASPLOS XV.

[31]  Sanjit A. Seshia,et al.  Compositional programming and testing of dynamic distributed systems , 2018, Proc. ACM Program. Lang..

[32]  Tom Ridge,et al.  SibylFS: formal specification and oracle-based testing for POSIX and real-world file systems , 2015, SOSP.

[33]  Peter Müller,et al.  Leveraging rust types for modular specification and verification , 2019, Proc. ACM Program. Lang..

[34]  Patrice Godefroid,et al.  Model checking for programming languages using VeriSoft , 1997, POPL '97.

[35]  Andrea C. Arpaci-Dusseau,et al.  All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications , 2014, OSDI.

[36]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[37]  Xi Wang,et al.  Hyperkernel: Push-Button Verification of an OS Kernel , 2017, SOSP.

[38]  C. Newcombe,et al.  How Amazon web services uses formal methods , 2015, Commun. ACM.

[39]  Jon Howell,et al.  Storage Systems are Distributed Systems (So Verify Them That Way!) , 2020, OSDI.

[40]  Xuejun Yang,et al.  Test-case reduction for C compiler bugs , 2012, PLDI.