Shadows on the Cloud: An Energy-aware, Profit Maximizing Resilience Framework for Cloud Computing

As the demand for cloud computing continues to increase, cloud service providers face the daunting challenge to meet the negotiated SLA agreement, in terms of reliability and timely performance, while achieving costeffectiveness. This challenge is increasingly compounded by the increasing likelihood of failure in largescale clouds and the rising cost of energy consumption. This paper proposes Shadow Replication, a novel profit-maximization resiliency model, which seamlessly addresses failure at scale, while minimizing energy consumption. The basic tenet of the model is to associate a suite of shadow processes to execute concurrently with the main process, but initially at a much reduced execution speed, to overcome failures as they occur. Two computationally-feasible schemes are proposed to achieve shadow replication. A performance evaluation framework is developed to analyze these schemes and compare their performance to traditional replicationbased fault tolerance methods, focusing on the inherent tradeoff between fault tolerance, the specified SLA and profit maximization. The results show Shadow Replication leads to significant energy reduction, and is better suited for compute-intensive execution models, where up to 30% more profit increase can be achieved.

[1]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[2]  Vincent K. N. Lau,et al.  Automatic Performance Setting for Dynamic Voltage Scaling , 2002, Wirel. Networks.

[3]  Vanish Talwar,et al.  No "power" struggles: coordinated multi-level power management for the data center , 2008, ASPLOS.

[4]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[5]  Felix C. Gärtner,et al.  Fundamentals of fault-tolerant distributed computing in asynchronous environments , 1999, CSUR.

[6]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[7]  Kang G. Shin,et al.  Real-time dynamic voltage scaling for low-power embedded operating systems , 2001, SOSP.

[8]  Vincenzo Piuri,et al.  Fault Tolerance Management in Cloud Computing: A System-Level Perspective , 2013, IEEE Systems Journal.

[9]  David S. Touretzky,et al.  Long-Term Reward Prediction in TD Models of the Dopamine System , 2002, Neural Computation.

[10]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[11]  E. N. Elnozahy,et al.  Energy Conservation Policies for Web Servers , 2003, USENIX Symposium on Internet Technologies and Systems.

[12]  Indranil Gupta,et al.  Making cloud intermediate data fault-tolerant , 2010, SoCC '10.

[13]  Paulo Veríssimo,et al.  Resilient state machine replication , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[14]  Franck Cappello,et al.  BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  Laurent Broto,et al.  Approaches to cloud computing fault tolerance , 2012, 2012 International Conference on Computer, Information and Telecommunication Systems (CITS).

[16]  Achour Mostéfaoui,et al.  Preventing useless checkpoints in distributed computations , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[17]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Pascal Bouvry,et al.  Amazon Elastic Compute Cloud (EC2) vs. In-House HPC Platform: A Cost Analysis , 2016, 2016 IEEE 9th International Conference on Cloud Computing (CLOUD).

[19]  Qin Zheng Improving MapReduce fault tolerance in the cloud , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[20]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[21]  Rami G. Melhem,et al.  Maximizing rewards for real-time applications with energy constraints , 2003, TECS.

[22]  Xue Liu,et al.  TailCon: Power-Minimizing Tail Percentile Control of Response Time in Server Clusters , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[23]  Wei-Tek Tsai,et al.  Service Replication Strategies with MapReduce in Clouds , 2011, 2011 Tenth International Symposium on Autonomous Decentralized Systems.

[24]  Franck Cappello,et al.  Energy considerations in checkpointing and fault tolerance protocols , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[25]  Sara Bouchenak,et al.  Benchmarking Dependability of MapReduce Systems , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[26]  David Blaauw,et al.  Theoretical and practical limits of dynamic voltage scaling , 2004, Proceedings. 41st Design Automation Conference, 2004..

[27]  Lachlan L. H. Andrew,et al.  Dynamic Right-Sizing for Power-Proportional Data Centers , 2011, IEEE/ACM Transactions on Networking.

[28]  Louise E. Moser,et al.  Fault Tolerance Middleware for Cloud Computing , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.