Canary: Fault-Tolerant FaaS for Stateful Time-Sensitive Applications

Function-as-a-Service (FaaS) platforms have recently gained rapid popularity. Many stateful applications have been migrated to FaaS platforms due to their ease of deployment, scalability, and minimal management overhead. However, failures in FaaS have not been thoroughly investigated, thus making these desirable platforms unreliable for guaranteeing function execution and ensuring performance requirements. In this paper, we propose Canary, a highly resilient and fault-tolerant framework for FaaS that mitigates the impact of failures and reduces the overhead of function restart. Canary utilizes replicated container runtimes and application-level checkpoints to reduce application recovery time over FaaS platforms. Our evaluations using representative stateful FaaS applications show that Canary reduces the application recovery time and dollar cost by up to 83% and 12%, respectively over the default retry-based strategy. Moreover, it improves application availability with an additional average execution time and cost overhead of 14% and 8%, respectively, as compared to the ideal failure-free execution.

[1]  Sudharshan S. Vazhkudai,et al.  Exploiting CXL-based Memory for Distributed Deep Learning , 2022, ICPP.

[2]  Dimitrios S. Nikolopoulos,et al.  On Realizing Efficient Deep Learning Using Serverless Computing , 2022, 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid).

[3]  Marc Sánchez Artigas,et al.  Stateful Serverless Computing with Crucial , 2022, ACM Trans. Softw. Eng. Methodol..

[4]  Leonid Ryzhyk,et al.  Cloud-Scale Runtime Verification of Serverless Applications , 2021, SoCC.

[5]  Samuel Williams,et al.  Architectural Requirements for Deep Learning Workloads in HPC Environments , 2021, 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[6]  Emmett Witchel,et al.  Boki: Stateful Serverless Computing with Shared Logs , 2021, SOSP.

[7]  Asterios Katsifodimos,et al.  Distributed transactions on serverless stateful functions , 2021, DEBS.

[8]  Mark Szalay,et al.  Predicting cloud-native application failures based on monitoring data of cloud infrastructure , 2021, 2021 IFIP/IEEE International Symposium on Integrated Network Management (IM).

[9]  Rekha Singhal,et al.  High Performance Serverless Architecture for Deep Learning Workflows , 2021, 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid).

[10]  M. Muthukannan,et al.  Self-healing Fault Tolerance Technique in Cloud Datacenter , 2021, 2021 6th International Conference on Inventive Computation Technologies (ICICT).

[11]  T. Hoefler,et al.  SeBS: a serverless benchmark suite for function-as-a-service computing , 2020, Middleware.

[12]  Nikos Parlavantzas,et al.  Active-Standby for High-Availability in FaaS , 2020, WOSC@Middleware.

[13]  Michael J. Freedman,et al.  Serverless Isn't Server-Less: Measuring and Exploiting Resource Variability on Cloud FaaS Platforms , 2020, WOSC@Middleware.

[14]  Daniel Fireman,et al.  Prebaking Functions to Warm the Serverless Cold Start , 2020, Middleware.

[15]  M. Mustafa Rafique,et al.  Infrastructure-Aware TensorFlow for Heterogeneous Datacenters , 2020, 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[16]  Wen Zhang,et al.  Kappa: a programming framework for serverless computing , 2020, SoCC.

[17]  Rekha Singhal,et al.  Migrating Large Deep Learning Models to Serverless Architecture , 2020, 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).

[18]  Joseph M. Hellerstein,et al.  A FaaS File System for Serverless Computing , 2020, ArXiv.

[19]  Osman Unsal,et al.  Checkpoint Restart Support for Heterogeneous HPC Applications , 2020, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).

[20]  Joseph E. Gonzalez,et al.  A fault-tolerance shim for serverless computing , 2020, EuroSys.

[21]  Peter Pietzuch,et al.  Faasm: Lightweight Isolation for Efficient Stateful Serverless Computing , 2020, USENIX Annual Technical Conference.

[22]  Joseph M. Hellerstein,et al.  Cloudburst , 2020, Proc. VLDB Endow..

[23]  Marc Sánchez Artigas,et al.  On the FaaS Track: Building Stateful Distributed Applications with Serverless Architectures , 2019, Middleware.

[24]  Steven Swanson,et al.  An Empirical Guide to the Behavior and Use of Scalable Persistent Memory , 2019, FAST.

[25]  Lei Huang,et al.  Performant Container Support for HPC Applications , 2019, PEARC.

[26]  Guyue Liu,et al.  Living on the Edge: Serverless Computing and the Cost of Failure Resiliency , 2019, 2019 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN).

[27]  Leonardo Bautista-Gomez,et al.  Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[28]  Henning Schulzrinne,et al.  Checkpointing and Migration of IoT Edge Functions , 2019, EdgeSys@EuroSys.

[29]  Xiao Liu,et al.  Basic Performance Measurements of the Intel Optane DC Persistent Memory Module , 2019, ArXiv.

[30]  David Jackson,et al.  An Investigation of the Impact of Language Runtime on the Performance and Cost of Serverless Functions , 2018, 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion).

[31]  Nirmeen A. El-Bahnasawy,et al.  On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems , 2018, J. Ambient Intell. Humaniz. Comput..

[32]  Rami G. Melhem,et al.  Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Rajkumar Buyya,et al.  Using Proactive Fault-Tolerance Approach to Enhance Cloud Service Reliability , 2018, IEEE Transactions on Cloud Computing.

[34]  Geoffrey C. Fox,et al.  Evaluation of Production Serverless Computing Environments , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).

[35]  Gerald Kotonya,et al.  A Microservices Architecture for Reactive and Proactive Fault Tolerance in IoT Systems , 2018, 2018 IEEE 19th International Symposium on "A World of Wireless, Mobile and Multimedia Networks" (WoWMoM).

[36]  Joseph M. Hellerstein,et al.  Anna: A KVS for Any Scale , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[37]  Turgay Celik,et al.  Toward a Smart Cloud: A Review of Fault-Tolerance Methods in Cloud Systems , 2018, IEEE Transactions on Services Computing.

[38]  Sathya Chinnathambi,et al.  Scheduling and checkpointing optimization algorithm for Byzantine fault tolerance in cloud clusters , 2018, Cluster Computing.

[39]  Mohamed Elkawkagy,et al.  A reactive fault tolerance approach for cloud computing , 2017, 2017 13th International Computer Engineering Conference (ICENCO).

[40]  Kevin T. Pedretti,et al.  A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds , 2017, 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom).

[41]  Vatche Ishakian,et al.  Serving Deep Learning Models in a Serverless Platform , 2017, 2018 IEEE International Conference on Cloud Engineering (IC2E).

[42]  Brendan Burns,et al.  Kubernetes: Up and Running: Dive into the Future of Infrastructure , 2017 .

[43]  Wei Xu,et al.  What Can We Learn from Four Years of Data Center Hardware Failures? , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[44]  Pavel Tariqul Islam,et al.  Predicting Application Failure in Cloud: A Machine Learning Approach , 2017, 2017 IEEE International Conference on Cognitive Computing (ICCC).

[45]  Omer Subasi,et al.  Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[46]  Keun Soo Yim,et al.  Evaluation Metrics of Service-Level Reliability Monitoring Rules of a Big Data Service , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[47]  N. Mansouri Adaptive data replication strategy in cloud computing for performance improvement , 2016, Frontiers of Computer Science.

[48]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Salwa M. Nassar,et al.  Fault tolerance in cloud computing - survey , 2015, 2015 11th International Computer Engineering Conference (ICENCO).

[51]  Changhai Zhao,et al.  Event-Driven Fault Tolerance for Building Nonstop Active Message Programs , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[52]  Navendu Jain,et al.  Demystifying the dark side of the middle: a field study of middlebox failures in datacenters , 2013, Internet Measurement Conference.

[53]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[54]  Bianca Schroeder,et al.  Reading between the lines of failure logs: Understanding how HPC systems fail , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[55]  Heinz W. Schmidt,et al.  Parameterised architectural patterns for providing cloud service fault tolerance with accurate costings , 2013, CBSE '13.

[56]  Gurpreet Singh,et al.  Fault Tolerance Techniques and Comparative Implementation in Cloud Computing , 2013 .

[57]  Avishay Traeger,et al.  To Zip or not to Zip: effective resource usage for real-time compression , 2013, FAST.

[58]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[59]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[60]  J. Chris Anderson,et al.  CouchDB - The Definitive Guide: Time to Relax , 2010 .

[61]  Radu Prodan,et al.  A New Fault Tolerance Heuristic for Scientific Workflows in Highly Distributed Environments Based on Resubmission Impact , 2009, 2009 Fifth IEEE International Conference on e-Science.

[62]  Christian Engelmann,et al.  Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[63]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[64]  Thomas Haynes,et al.  Network File System (NFS) Version 4 Protocol , 2003, RFC.

[65]  Evangelos P. Markatos,et al.  The Network RamDisk: Using remote memory on heterogeneous NOWs , 1999, Cluster Computing.

[66]  C. Morin,et al.  Request Replication for FaaS Fault Tolerance , 2023 .

[67]  Daniel C. Stanzione,et al.  Lessons Learned from the Chameleon Testbed , 2020, USENIX Annual Technical Conference.

[68]  Vincent Liu,et al.  Fault-tolerant and transactional stateful serverless workflows , 2020, OSDI.

[69]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[70]  Christian Engelmann,et al.  Redundant Execution of HPC Applications with MR-MPI , 2011 .

[71]  Laxmikant V. Kale,et al.  Proactive Fault Tolerance in Large Systems , 2004 .