Canary: Fault-Tolerant FaaS for Stateful Time-Sensitive Applications
暂无分享,去创建一个
[1] Sudharshan S. Vazhkudai,et al. Exploiting CXL-based Memory for Distributed Deep Learning , 2022, ICPP.
[2] Dimitrios S. Nikolopoulos,et al. On Realizing Efficient Deep Learning Using Serverless Computing , 2022, 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid).
[3] Marc Sánchez Artigas,et al. Stateful Serverless Computing with Crucial , 2022, ACM Trans. Softw. Eng. Methodol..
[4] Leonid Ryzhyk,et al. Cloud-Scale Runtime Verification of Serverless Applications , 2021, SoCC.
[5] Samuel Williams,et al. Architectural Requirements for Deep Learning Workloads in HPC Environments , 2021, 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).
[6] Emmett Witchel,et al. Boki: Stateful Serverless Computing with Shared Logs , 2021, SOSP.
[7] Asterios Katsifodimos,et al. Distributed transactions on serverless stateful functions , 2021, DEBS.
[8] Mark Szalay,et al. Predicting cloud-native application failures based on monitoring data of cloud infrastructure , 2021, 2021 IFIP/IEEE International Symposium on Integrated Network Management (IM).
[9] Rekha Singhal,et al. High Performance Serverless Architecture for Deep Learning Workflows , 2021, 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid).
[10] M. Muthukannan,et al. Self-healing Fault Tolerance Technique in Cloud Datacenter , 2021, 2021 6th International Conference on Inventive Computation Technologies (ICICT).
[11] T. Hoefler,et al. SeBS: a serverless benchmark suite for function-as-a-service computing , 2020, Middleware.
[12] Nikos Parlavantzas,et al. Active-Standby for High-Availability in FaaS , 2020, WOSC@Middleware.
[13] Michael J. Freedman,et al. Serverless Isn't Server-Less: Measuring and Exploiting Resource Variability on Cloud FaaS Platforms , 2020, WOSC@Middleware.
[14] Daniel Fireman,et al. Prebaking Functions to Warm the Serverless Cold Start , 2020, Middleware.
[15] M. Mustafa Rafique,et al. Infrastructure-Aware TensorFlow for Heterogeneous Datacenters , 2020, 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).
[16] Wen Zhang,et al. Kappa: a programming framework for serverless computing , 2020, SoCC.
[17] Rekha Singhal,et al. Migrating Large Deep Learning Models to Serverless Architecture , 2020, 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).
[18] Joseph M. Hellerstein,et al. A FaaS File System for Serverless Computing , 2020, ArXiv.
[19] Osman Unsal,et al. Checkpoint Restart Support for Heterogeneous HPC Applications , 2020, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).
[20] Joseph E. Gonzalez,et al. A fault-tolerance shim for serverless computing , 2020, EuroSys.
[21] Peter Pietzuch,et al. Faasm: Lightweight Isolation for Efficient Stateful Serverless Computing , 2020, USENIX Annual Technical Conference.
[22] Joseph M. Hellerstein,et al. Cloudburst , 2020, Proc. VLDB Endow..
[23] Marc Sánchez Artigas,et al. On the FaaS Track: Building Stateful Distributed Applications with Serverless Architectures , 2019, Middleware.
[24] Steven Swanson,et al. An Empirical Guide to the Behavior and Use of Scalable Persistent Memory , 2019, FAST.
[25] Lei Huang,et al. Performant Container Support for HPC Applications , 2019, PEARC.
[26] Guyue Liu,et al. Living on the Edge: Serverless Computing and the Cost of Failure Resiliency , 2019, 2019 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN).
[27] Leonardo Bautista-Gomez,et al. Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).
[28] Henning Schulzrinne,et al. Checkpointing and Migration of IoT Edge Functions , 2019, EdgeSys@EuroSys.
[29] Xiao Liu,et al. Basic Performance Measurements of the Intel Optane DC Persistent Memory Module , 2019, ArXiv.
[30] David Jackson,et al. An Investigation of the Impact of Language Runtime on the Performance and Cost of Serverless Functions , 2018, 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion).
[31] Nirmeen A. El-Bahnasawy,et al. On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems , 2018, J. Ambient Intell. Humaniz. Comput..
[32] Rami G. Melhem,et al. Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[33] Rajkumar Buyya,et al. Using Proactive Fault-Tolerance Approach to Enhance Cloud Service Reliability , 2018, IEEE Transactions on Cloud Computing.
[34] Geoffrey C. Fox,et al. Evaluation of Production Serverless Computing Environments , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).
[35] Gerald Kotonya,et al. A Microservices Architecture for Reactive and Proactive Fault Tolerance in IoT Systems , 2018, 2018 IEEE 19th International Symposium on "A World of Wireless, Mobile and Multimedia Networks" (WoWMoM).
[36] Joseph M. Hellerstein,et al. Anna: A KVS for Any Scale , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).
[37] Turgay Celik,et al. Toward a Smart Cloud: A Review of Fault-Tolerance Methods in Cloud Systems , 2018, IEEE Transactions on Services Computing.
[38] Sathya Chinnathambi,et al. Scheduling and checkpointing optimization algorithm for Byzantine fault tolerance in cloud clusters , 2018, Cluster Computing.
[39] Mohamed Elkawkagy,et al. A reactive fault tolerance approach for cloud computing , 2017, 2017 13th International Computer Engineering Conference (ICENCO).
[40] Kevin T. Pedretti,et al. A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds , 2017, 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom).
[41] Vatche Ishakian,et al. Serving Deep Learning Models in a Serverless Platform , 2017, 2018 IEEE International Conference on Cloud Engineering (IC2E).
[42] Brendan Burns,et al. Kubernetes: Up and Running: Dive into the Future of Infrastructure , 2017 .
[43] Wei Xu,et al. What Can We Learn from Four Years of Data Center Hardware Failures? , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[44] Pavel Tariqul Islam,et al. Predicting Application Failure in Cloud: A Machine Learning Approach , 2017, 2017 IEEE International Conference on Cognitive Computing (ICCC).
[45] Omer Subasi,et al. Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).
[46] Keun Soo Yim,et al. Evaluation Metrics of Service-Level Reliability Monitoring Rules of a Big Data Service , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).
[47] N. Mansouri. Adaptive data replication strategy in cloud computing for performance improvement , 2016, Frontiers of Computer Science.
[48] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.
[49] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Salwa M. Nassar,et al. Fault tolerance in cloud computing - survey , 2015, 2015 11th International Computer Engineering Conference (ICENCO).
[51] Changhai Zhao,et al. Event-Driven Fault Tolerance for Building Nonstop Active Message Programs , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.
[52] Navendu Jain,et al. Demystifying the dark side of the middle: a field study of middlebox failures in datacenters , 2013, Internet Measurement Conference.
[53] Bran Selic,et al. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.
[54] Bianca Schroeder,et al. Reading between the lines of failure logs: Understanding how HPC systems fail , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[55] Heinz W. Schmidt,et al. Parameterised architectural patterns for providing cloud service fault tolerance with accurate costings , 2013, CBSE '13.
[56] Gurpreet Singh,et al. Fault Tolerance Techniques and Comparative Implementation in Cloud Computing , 2013 .
[57] Avishay Traeger,et al. To Zip or not to Zip: effective resource usage for real-time compression , 2013, FAST.
[58] Navendu Jain,et al. Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.
[59] Kashi Venkatesh Vishwanath,et al. Characterizing cloud computing hardware reliability , 2010, SoCC '10.
[60] J. Chris Anderson,et al. CouchDB - The Definitive Guide: Time to Relax , 2010 .
[61] Radu Prodan,et al. A New Fault Tolerance Heuristic for Scientific Workflows in Highly Distributed Environments Based on Resubmission Impact , 2009, 2009 Fifth IEEE International Conference on e-Science.
[62] Christian Engelmann,et al. Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.
[63] Sanjay Ghemawat,et al. MapReduce: simplified data processing on large clusters , 2008, CACM.
[64] Thomas Haynes,et al. Network File System (NFS) Version 4 Protocol , 2003, RFC.
[65] Evangelos P. Markatos,et al. The Network RamDisk: Using remote memory on heterogeneous NOWs , 1999, Cluster Computing.
[66] C. Morin,et al. Request Replication for FaaS Fault Tolerance , 2023 .
[67] Daniel C. Stanzione,et al. Lessons Learned from the Chameleon Testbed , 2020, USENIX Annual Technical Conference.
[68] Vincent Liu,et al. Fault-tolerant and transactional stateful serverless workflows , 2020, OSDI.
[69] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..
[70] Christian Engelmann,et al. Redundant Execution of HPC Applications with MR-MPI , 2011 .
[71] Laxmikant V. Kale,et al. Proactive Fault Tolerance in Large Systems , 2004 .