Fault Tolerance in the Cloud

Cloud Computing is an emerging and innovative platform, which makes computing and storage available to the end-users as services. The cloud is a “blob” of unstructured resources that are classified into three domains: (a) applications (or software), (b) platform, and (c) infrastructure. The cloud is a merger of business and computing models, which makes it a very important scientific and business medium for the end-users. Cloud Computing has established a widespread adoption in various domains, such as research, business, health, e-commerce, agriculture, and social life. Recently, cloud computing has increasingly been employed for a wide range of applications in various research domains, such as agriculture, smart grids, e-commerce, scientific applications, healthcare, and nuclear science. In the “Market Trends” report by Gartner, it is estimated that the cloud-based business services and Software-as-a-Service (SaaS) market will increase from $13.4 to $32.2 billion from 2011 to 2016. Similarly, Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) market is estimated to grow from $7.6 billion to $35.5 billion from 2011 to 2016. The cloud investments have delivered around $4 billion benefit yield in the last five years.

[1]  Eric Bauer,et al.  Reliability and Availability of Cloud Computing: Bauer/Cloud Computing , 2012 .

[2]  Israel Koren,et al.  Fault-Tolerant Systems , 2007 .

[3]  Fabrice Huet,et al.  Adaptive Fault Tolerance in Real Time Cloud Computing , 2011, 2011 IEEE World Congress on Services.

[4]  Alysson Neves Bessani,et al.  FITCH: Supporting Adaptive Replicated Services in the Cloud , 2013, DAIS.

[5]  Jianguo Liu,et al.  AAD: Adaptive Anomaly Detection System for Cloud Computing Infrastructures , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[6]  Franck Cappello,et al.  BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Samee Ullah Khan,et al.  Modeling and Analysis of State-of-the-art VM-based Cloud Management Platforms , 2013, IEEE Transactions on Cloud Computing.

[8]  Zibin Zheng,et al.  BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[9]  Jordi Torres,et al.  Checkpoint-based fault-tolerant infrastructure for virtualized service providers , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[10]  Eric Bauer,et al.  Reliability and Availability of Cloud Computing , 2012 .

[11]  Zizhong Chen,et al.  Multilevel Diskless Checkpointing , 2013, IEEE Transactions on Computers.

[12]  Albert Y. Zomaya,et al.  Trends and challenges in cloud datacenters , 2014, IEEE Cloud Computing.

[13]  Louise E. Moser,et al.  Fault Tolerance Middleware for Cloud Computing , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[14]  Zizhong Chen Multi-Level Diskless Checkpointing , 2011 .

[15]  Qin Zheng Improving MapReduce fault tolerance in the cloud , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[16]  Albert G. Greenberg,et al.  The cost of a cloud: research problems in data center networks , 2008, CCRV.

[17]  Marcel Gagné Cooking with Linux: still searching for the ultimate linux distro? , 2007 .

[18]  Indranil Gupta,et al.  Making cloud intermediate data fault-tolerant , 2010, SoCC '10.

[19]  Jung-Min Yang,et al.  Optimal Checkpoint Placement on Real-Time Tasks with Harmonic Periods , 2012, Journal of Computer Science and Technology.

[20]  Guiran Chang,et al.  Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments , 2013, The Journal of Supercomputing.

[21]  Albert Y. Zomaya,et al.  On the Characterization of the Structural Robustness of Data Center Networks , 2013, IEEE Transactions on Cloud Computing.