The impact of checkpointing interval selection on the scheduling performance of real‐time fine‐grained parallel applications in SaaS clouds under various failure probabilities

As the adoption of Software as a Service (SaaS) cloud computing continues to gain momentum, the arising challenges of scheduling parallel applications on such platforms need to be addressed. Due to the complexity and the fine‐grained parallelism of the workload, as well as the multi‐tenancy of the underlying host environment, end‐user applications are usually prone to transient software failures. Therefore, fault tolerance is one of the most crucial aspects of scheduling in SaaS clouds. It is usually achieved through application‐directed checkpointing. However, selecting an appropriate checkpointing interval is not a trivial task. Unnecessary frequent checkpointing may degrade the system performance. On the other hand, infrequent checkpointing may lead to greater recovery time and thus poorer performance. Consequently, the checkpointing interval must be selected taking into account the failure probability, as well as the nature of the workload. Towards this direction, we investigate via simulation the impact of checkpointing interval selection on the performance of a SaaS cloud, where fine‐grained parallel applications with firm deadlines and approximate computations are scheduled for execution, under various failure probabilities. The simulation results are analyzed, in an attempt to shed light on the relation between the checkpointing interval and failure probability.

[1]  Rajkumar Buyya,et al.  SLA-Based Resource Scheduling for Big Data Analytics as a Service in Cloud Computing Environments , 2015, 2015 44th International Conference on Parallel Processing.

[2]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[3]  Achour Mostéfaoui,et al.  Preventing useless checkpoints in distributed computations , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[4]  Helen D. Karatza Scheduling jobs with different characteristics in distributed systems , 2014, 2014 International Conference on Computer, Information and Telecommunication Systems (CITS).

[5]  Sandeep Sharma,et al.  A Comparative Review on Fault Tolerance methods and models in Cloud Computing , 2016 .

[6]  Denis Trystram,et al.  On the Scheduling of Checkpoints in Desktop Grids , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[7]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[8]  Rajkumar Buyya,et al.  Energy-Efficient Scheduling of Urgent Bag-of-Tasks Applications in Clouds through DVFS , 2014, 2014 IEEE 6th International Conference on Cloud Computing Technology and Science.

[9]  G. Karagiannis,et al.  Cloud computing services: taxonomy and comparison , 2011, Journal of Internet Services and Applications.

[10]  Jane W.-S. Liu,et al.  Imprecise Results: Utilizing Partial Comptuations in Real-Time Systems , 1987, RTSS.

[11]  Helen D. Karatza,et al.  Scheduling real-time parallel applications in SaaS clouds in the presence of transient software failures , 2016, 2016 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS).

[12]  Helen D. Karatza,et al.  Scheduling real-time DAGs in heterogeneous clusters by combining imprecise computations and bin packing techniques for the exploitation of schedule holes , 2012, Future Gener. Comput. Syst..

[13]  James W. Layland,et al.  Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment , 1989, JACM.

[14]  Bojan Cukic,et al.  Log-Based Reliability Analysis of Software as a Service (SaaS) , 2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering.

[15]  Rajkumar Buyya,et al.  Power-aware provisioning of Cloud resources for real-time services , 2009, MGC '09.

[16]  Daniel Sun,et al.  Reliability and energy efficiency in cloud computing systems: Survey and taxonomy , 2016, J. Netw. Comput. Appl..

[17]  Helen D. Karatza,et al.  Performance of gang scheduling strategies in a parallel system , 2009, Simul. Model. Pract. Theory.

[18]  L Stavrinides Georgios,et al.  Scheduling real-time parallel applications in SaaS clouds in the presence of transient software failures , 2016 .

[19]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[20]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[21]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[22]  Denis Trystram,et al.  Fault-tolerant scheduling on parallel systems with non-memoryless failure distributions , 2014, J. Parallel Distributed Comput..

[23]  Xiaomin Zhu,et al.  QoS-Aware Fault-Tolerant Scheduling for Real-Time Tasks on Heterogeneous Clusters , 2011, IEEE Transactions on Computers.

[24]  Jesús Carretero,et al.  Different aspects of workflow scheduling in large-scale distributed systems , 2017, Simul. Model. Pract. Theory.

[25]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[26]  Parameswaran Ramanathan,et al.  Inserting Placeholder Slack to Improve Run-Time Scheduling of Non-preemptible Real-Time Tasks in Heterogeneous Systems , 2014, 2014 27th International Conference on VLSI Design and 2014 13th International Conference on Embedded Systems.

[27]  Giorgio Buttazzo,et al.  Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications , 1997 .

[28]  Georgios L. Stavrinides,et al.  Scheduling Different Types of Applications in a SaaS Cloud , 2016, BMSD 2016.

[29]  Helen D. Karatza,et al.  Fault-tolerant Gang Scheduling in Distributed Real-time Systems Utilizing Imprecise Computations , 2009, Simul..

[30]  Giorgio C. Buttazzo,et al.  HARD REAL-TIME COMPUTING SYSTEMS Predictable Scheduling Algorithms and Applications , 2007 .

[31]  Helen D. Karatza,et al.  A Cost-Effective and QoS-Aware Approach to Scheduling Real-Time Workflow Applications in PaaS and SaaS Clouds , 2015, 2015 3rd International Conference on Future Internet of Things and Cloud.

[32]  Luiz Fernando Bittencourt,et al.  Workflow scheduling for SaaS / PaaS cloud providers considering two SLA levels , 2012, 2012 IEEE Network Operations and Management Symposium.

[33]  Belabbes Yagoubi,et al.  On the Optimum Checkpointing Interval Selection for Variable Size Checkpoint Dumps , 2015, CIIA.

[34]  G.L. Stavrinides,et al.  Performance evaluation of gang scheduling in distributed real-time systems with possible software faults , 2008, 2008 International Symposium on Performance Evaluation of Computer and Telecommunication Systems.