A fault-tolerance shim for serverless computing

Serverless computing has grown in popularity in recent years, with an increasing number of applications being built on Functions-as-a-Service (FaaS) platforms. By default, FaaS platforms support retry-based fault tolerance, but this is insufficient for programs that modify shared state, as they can unwittingly persist partial sets of updates in case of failures. To address this challenge, we would like atomic visibility of the updates made by a FaaS application. In this paper, we present aft, an atomic fault tolerance shim for serverless applications. aft interposes between a commodity FaaS platform and storage engine and ensures atomic visibility of updates by enforcing the read atomic isolation guarantee. aft supports new protocols to guarantee read atomic isolation in the serverless setting. We demonstrate that aft introduces minimal overhead relative to existing storage engines and scales smoothly to thousands of requests per second, while preventing a significant number of consistency anomalies.

[1]  Joseph M. Hellerstein,et al.  Serverless Computing: One Step Forward, Two Steps Back , 2018, CIDR.

[2]  Ion Stoica,et al.  Occupy the cloud: distributed computing for the 99% , 2017, SoCC.

[3]  Stephanie Wang,et al.  Lineage stash: fault tolerance off the critical path , 2019, SOSP.

[4]  Paarijaat Aditya,et al.  SAND: Towards High-Performance Serverless Computing , 2018, USENIX Annual Technical Conference.

[5]  Jim Gray,et al.  Fault Tolerance in Tandem Computer Systems , 1987 .

[6]  Andrew Pavlo,et al.  An Empirical Evaluation of In-Memory Multi-Version Concurrency Control , 2017, Proc. VLDB Endow..

[7]  Philip A. Bernstein,et al.  Categories and Subject Descriptors: H.2.4 [Database Management]: Systems. , 2022 .

[8]  Lorenzo Alvisi,et al.  I Can't Believe It's Not Causal! Scalable Causal Consistency with No Slowdown Cascades , 2017, NSDI.

[9]  Zhe Wu,et al.  CosTLO: Cost-Effective Redundancy for Lower Latency Variance on Cloud Storage Services , 2015, NSDI.

[10]  Alexandru Iosup,et al.  The SPEC cloud group's research vision on FaaS and serverless architectures , 2017, WOSC@Middleware.

[11]  Tim Kraska,et al.  Building a database on S3 , 2008, SIGMOD Conference.

[12]  Michael Stonebraker,et al.  E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing , 2014, Proc. VLDB Endow..

[13]  Anirudh Sivaraman,et al.  Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads , 2017, NSDI.

[14]  Yuan He,et al.  An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems , 2019, ASPLOS.

[15]  Ali Kanso,et al.  Comparing Containers versus Virtual Machines for Achieving High Availability , 2015, 2015 IEEE International Conference on Cloud Engineering.

[16]  A. Fleischmann Distributed Systems , 1994, Springer Berlin Heidelberg.

[17]  Christoforos E. Kozyrakis,et al.  From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers , 2019, USENIX Annual Technical Conference.

[18]  E. Brewer,et al.  CAP twelve years later: How the "rules" have changed , 2012, Computer.

[19]  Ali Ghodsi,et al.  Scalable atomic visibility with RAMP transactions , 2014, SIGMOD Conference.

[20]  Michael Stonebraker,et al.  The Design of the POSTGRES Storage System , 1988, VLDB.

[21]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[22]  Andrew S. Tanenbaum,et al.  Distributed Systems , 2007 .

[23]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[24]  David A. Patterson,et al.  Cloud Programming Simplified: A Berkeley View on Serverless Computing , 2019, ArXiv.

[25]  Dan Suciu,et al.  Skew in parallel query processing , 2014, PODS.

[26]  Christoforos E. Kozyrakis,et al.  Pocket: Elastic Ephemeral Storage for Serverless Analytics , 2018, OSDI.

[27]  Joseph M. Hellerstein,et al.  Anna: A KVS for Any Scale , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[28]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[29]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[30]  David P. Reed,et al.  Naming and synchronization in a decentralized computer system , 1978 .

[31]  Daniel J. Abadi,et al.  Rethinking serializable multiversion concurrency control , 2014, Proc. VLDB Endow..

[32]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[33]  Alfons Kemper,et al.  Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems , 2015, SIGMOD Conference.

[34]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[35]  Annette Bieniusa,et al.  Cure: Strong Semantics Meets High Availability and Low Latency , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[36]  Perry Cheng,et al.  Serverless Computing: Current Trends and Open Problems , 2017, Research Advances in Cloud Computing.