Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput

Large-scale applications rely on resilience mechanisms such as checkpoint-restart to make forward progress in the presence of failures. Unfortunately, this incurs huge I/O overhead and impedes productivity. To mitigate this challenge, this paper introduces a new technique, Shiraz, which demonstrates how to exploit differences in the checkpointing overhead among applications and knowledge of temporal characteristics of failures to improve both the overall system throughput and performance of individual applications.

[1]  Franck Cappello,et al.  Reducing Waste in Extreme Scale Systems through Introspective Analysis , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[2]  Kenli Li,et al.  Maximizing reliability with energy conservation for parallel task scheduling in a heterogeneous cluster , 2015, Inf. Sci..

[3]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[4]  Siamak Mohammadi,et al.  Reliability-Aware Task Scheduling using Clustered Replication for Multi-core Real-Time systems , 2016, NoCArc'16.

[5]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Franck Cappello,et al.  Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model , 2017, IEEE Transactions on Parallel and Distributed Systems.

[7]  Saurabh Gupta,et al.  Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[8]  Franck Cappello,et al.  Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[9]  Andrea Rosà,et al.  Failure Analysis and Prediction for Big-Data Systems , 2017, IEEE Transactions on Services Computing.

[10]  Bruce Jacob,et al.  Fast full system memory checkpointing with SSD-aware memory controller , 2016, MEMSYS.

[11]  Franck Cappello,et al.  AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing , 2013, HPDC.

[12]  Gokcen Kestor,et al.  Toward a General Theory of Optimal Checkpoint Placement , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[13]  Andrea Rosà,et al.  Predicting and Mitigating Jobs Failures in Big Data Clusters , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[14]  Franck Cappello,et al.  Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[15]  Franck Cappello,et al.  Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[16]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[17]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[18]  Gene Cooperman,et al.  DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[19]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[20]  Bianca Schroeder,et al.  Reading between the lines of failure logs: Understanding how HPC systems fail , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[21]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[22]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[23]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[24]  Bronis R. de Supinski,et al.  Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System , 2014, IEEE Transactions on Parallel and Distributed Systems.

[25]  S. Scott,et al.  Reliability Analysis in HPC clusters , 2006 .

[26]  Kenli Li,et al.  Reliability-aware scheduling strategy for heterogeneous distributed computing systems , 2010, J. Parallel Distributed Comput..

[27]  Robert Birke,et al.  Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[28]  Rolf Riesen,et al.  Accelerating incremental checkpointing for extreme-scale computing , 2013, Future Gener. Comput. Syst..

[29]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[30]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[31]  Stephen L. Scott,et al.  An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[32]  Yves Robert,et al.  Towards Optimal Multi-Level Checkpointing , 2017, IEEE Transactions on Computers.

[33]  Kenli Li,et al.  A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems , 2017, Journal of Grid Computing.