Fault-tolerant and Transactional Stateful Serverless Workflows (extended version)

This paper introduces Beldi, a library and runtime system for writing and composing fault-tolerant and transactional stateful serverless functions. Beldi runs on existing providers and lets developers write complex stateful applications that require fault tolerance and transactional semantics without the need to deal with tasks such as load balancing or maintaining virtual machines. Beldi's contributions include extending the log-based fault-tolerant approach in Olive (OSDI 2016) with new data structures, transaction protocols, function invocations, and garbage collection. They also include adapting the resulting framework to work over a federated environment where each serverless function has sovereignty over its own data. We implement three applications on Beldi, including a movie review service, a travel reservation system, and a social media site. Our evaluation on 1,000 AWS Lambdas shows that Beldi's approach is effective and affordable.

[1]  Yujie Liu,et al.  Practical Non-blocking Unordered Lists , 2013, DISC.

[2]  E. B. Moss,et al.  Nested Transactions: An Approach to Reliable Distributed Computing , 1985 .

[3]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[4]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[5]  John D. Valois Lock-free linked lists using compare-and-swap , 1995, PODC '95.

[6]  Ali Ghodsi,et al.  Scalable atomic visibility with RAMP transactions , 2014, SIGMOD Conference.

[7]  Philip A. Bernstein,et al.  Formal Aspects of Serializability in Database Concurrency Control , 1979, IEEE Transactions on Software Engineering.

[8]  Dennis Shasha,et al.  Deferred Runtime Pipelining for contentious multicore software transactions , 2019, EuroSys.

[9]  Eddie Kohler,et al.  Type-aware transactions for faster concurrent code , 2016, EuroSys.

[10]  Joseph M. Hellerstein,et al.  Cloudburst , 2020, Proc. VLDB Endow..

[11]  Yuan He,et al.  An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems , 2019, ASPLOS.

[12]  Michael F. Spear,et al.  Conflict Detection and Validation Strategies for Software Transactional Memory , 2006, DISC.

[13]  Mengyuan Li,et al.  Peeking Behind the Curtains of Serverless Platforms , 2018, USENIX Annual Technical Conference.

[14]  Rachid Guerraoui,et al.  On the correctness of transactional memory , 2008, PPoPP.

[15]  Arvind Krishnamurthy,et al.  Building consistent transactions with inconsistent replication , 2015, SOSP.

[16]  Christoforos E. Kozyrakis,et al.  From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers , 2019, USENIX Annual Technical Conference.

[17]  Divyakant Agrawal,et al.  Low-Latency Multi-Datacenter Databases using Replicated Commit , 2013, Proc. VLDB Endow..

[18]  Dushyanth Narayanan,et al.  Fast General Distributed Transactions with Opacity , 2019, SIGMOD Conference.

[19]  Christos H. Papadimitriou,et al.  The serializability of concurrent database updates , 1979, JACM.

[20]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[21]  Marcos K. Aguilera,et al.  No Time for Asynchrony , 2009, HotOS.

[22]  Christoforos E. Kozyrakis,et al.  Pocket: Elastic Ephemeral Storage for Serverless Analytics , 2018, OSDI.

[23]  Jinyang Li,et al.  Consolidating Concurrency Control and Consensus for Commits under Conflicts , 2016, OSDI.

[24]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[25]  Joseph M. Hellerstein,et al.  Serverless Computing: One Step Forward, Two Steps Back , 2018, CIDR.

[26]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[27]  Yuriy Brun,et al.  Formal foundations of serverless computing , 2019, Proc. ACM Program. Lang..

[28]  Joseph E. Gonzalez,et al.  A fault-tolerance shim for serverless computing , 2020, EuroSys.

[29]  Timothy L. Harris,et al.  A Pragmatic Implementation of Non-blocking Linked-Lists , 2001, DISC.

[30]  Marcos K. Aguilera,et al.  Detecting failures in distributed systems with the Falcon spy network , 2011, SOSP.

[31]  Srinath T. V. Setty,et al.  Realizing the Fault-Tolerance Promise of Cloud Storage Using Locks with Intent , 2016, OSDI.