Optimum Checkpoints for Time and Energy

We study programs which operate in the presence of possible failures and which must be restarted from the beginning after each failure. In such systems checkpointsare introduced to reduce the large costs of program restarts when failures occur. Here we suggest that checkpoints should be introduced in a manner which assures effective reliability, while reducing both the computational overhead as much as possible, but also to save energy. We compute the total average program execution time in the presence of checkoints so as to limit the re-execution time of the program from the most recent checkpoint. We also study the total energy cnsumption of the program under the same conditions, and formulate an optimization problem to minimize a wighted sum of both average computation time and energy. This approach is placed in the context of Application Level Checkpointing and Restart (ALCR). We then focus on checkpoints placed at the beginning of a loop, and derive the optimum placement of checkpoints to minimize a weighted combination of the program's execution time and energy consumption. Numerical results are presented to illustrate the analysis. Finally we describe a software tool with a graphical interface that has been designed to assist a system designer in choosing the optimum checkpoint for a given program as a function of different failure rates and other parameters.

[1]  Anthony A. Maciejewski,et al.  Optimizing checkpoint intervals for reduced energy use in exascale systems , 2017, 2017 Eighth International Green and Sustainable Computing Conference (IGSC).

[2]  Gabriel Rodríguez,et al.  CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications , 2010, Concurr. Comput. Pract. Exp..

[3]  Erol Gelenbe,et al.  Performance of rollback recovery systems under intermittent failures , 1978, CACM.

[4]  Ritu Arora,et al.  ITALC: Interactive Tool for Application-Level Checkpointing , 2017 .

[5]  Satish K. Tripathi,et al.  Availability of a distributed computer system with failures , 2004, Acta Informatica.

[6]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  K. Mani Chandy,et al.  A Survey of Analytic Models of Rollback and Recovery Stratergies , 1975, Computer.

[8]  Yves Robert,et al.  Towards Optimal Multi-Level Checkpointing , 2017, IEEE Transactions on Computers.

[9]  Daniel A. Menascé,et al.  Efficient modeling and optimizing of checkpointing in concurrent component-based software systems , 2018, J. Syst. Softw..

[10]  Fabrizio Petrini,et al.  On the feasibility of incremental checkpointing for scientific computing , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[11]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[12]  Gabriel Rodríguez,et al.  Portable Application-level Checkpointing for Hybrid MPI-OpenMP Applications , 2016, ICCS.

[13]  Dimitrios Tzovaras,et al.  Static Analysis-Based Approaches for Secure Software Development , 2018, Euro-CYBERSEC.

[14]  Erol Gelenbe,et al.  Deep Learning with Dense Random Neural Network for Detecting Attacks against IoT-connected Home Environments , 2018, FNC/MobiSPC.

[15]  Erol Gelenbe,et al.  Optimum checkpoints with age dependent failures , 2004, Acta Informatica.

[16]  Thomas Ropars,et al.  The Architecture of the XtreemOS Grid Checkpointing Service , 2009, Euro-Par.

[17]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[18]  Erol Gelenbe,et al.  Choosing a Local or Remote Cloud , 2012, 2012 Second Symposium on Network Cloud Computing and Applications.

[19]  Marco Aiello,et al.  What IS Can Do for Environmental Sustainability: A Report from CAiSE'11 Panel on Green and Sustainable IS , 2012, Commun. Assoc. Inf. Syst..

[20]  Helen D. Karatza,et al.  An energy-efficient, QoS-aware and cost-effective scheduling approach for real-time workflow applications in cloud computing systems utilizing DVFS and approximate computations , 2019, Future Gener. Comput. Syst..

[21]  Erol Gelenbe,et al.  A model of roll-back recovery with multiple checkpoints , 1976, ICSE '76.

[22]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[23]  Erol Gelenbe,et al.  Energy packet networks: adaptive energy management for the cloud , 2012, CloudCP '12.

[24]  C. V. Ramamoorthy,et al.  Rollback and Recovery Strategies for Computer Programs , 1972, IEEE Transactions on Computers.

[25]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[26]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[27]  Erol Gelenbe,et al.  The impact of information technology on energy consumption and carbon emissions , 2015, UBIQ.

[28]  Helen D. Karatza,et al.  The impact of workload variability on the energy efficiency of large-scale heterogeneous distributed systems , 2018, Simul. Model. Pract. Theory.

[29]  Helen D. Karatza,et al.  The impact of checkpointing interval selection on the scheduling performance of real‐time fine‐grained parallel applications in SaaS clouds under various failure probabilities , 2018, Concurr. Comput. Pract. Exp..

[30]  Erol Gelenbe,et al.  Optimum Interval for Application-level Checkpoints , 2019, 2019 6th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/ 2019 5th IEEE International Conference on Edge Computing and Scalable Cloud (EdgeCom).

[31]  W. Kent Fuchs,et al.  Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[32]  Erol Gelenbe,et al.  Energy-Efficient Cloud Computing , 2010, Comput. J..

[33]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[34]  Erol Gelenbe,et al.  Neural network architectures for the detection of SYN flood attacks in IoT systems , 2020, PETRA.

[35]  Erol Gelenbe,et al.  On the modeling of parallel access to shared data , 1983, CACM.

[36]  Franco Zambonelli,et al.  A survey of autonomic communications , 2006, TAAS.

[37]  Satish K. Tripathi,et al.  Load sharing in distributed systems with failures , 1988, Acta Informatica.

[38]  Erol Gelenbe,et al.  Area-based results for mine detection , 2000, IEEE Trans. Geosci. Remote. Sens..

[39]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[40]  Albert Y. Zomaya,et al.  A Manifesto for Future Generation Cloud Computing: Research Directions for the Next Decade , 2017, ArXiv.

[41]  Erol Gelenbe,et al.  Optimum checkpoints for programs with loops , 2019, Simul. Model. Pract. Theory.

[42]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[43]  Helen D. Karatza,et al.  Energy monitoring as an essential building block towards sustainable ultrascale systems , 2017, Sustain. Comput. Informatics Syst..

[44]  Yves Robert,et al.  Energy-Aware Algorithms for Task Graph Scheduling, Replica Placement and Checkpoint Strategies , 2015, Handbook on Data Centers.

[45]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[46]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[47]  Gerhard Wellein,et al.  CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance , 2017, IEEE Transactions on Parallel and Distributed Systems.

[48]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.