论文信息 - A Fault-Tolerant High Performance Cloud Strategy for Scientific Computing

A Fault-Tolerant High Performance Cloud Strategy for Scientific Computing

Scientific computing often requires the availability of a massive number of computers for performing large scale experiments. Traditionally, high-performance computing solutions and installed facilities such as clusters and super computers have been employed to address these needs. Cloud computing provides scientists with a completely new model of utilizing the computing infrastructure with the ability to perform parallel computations using large pools of virtual machines (VMs). The infrastructure services (Infrastructure-as-a-service), provided by these cloud vendors, allow any user to provision a large number of compute instances. However, scientific computing is typically characterized by complex communication patterns and requires optimized runtimes. Today, VMs are manually instantiated, configured and maintained by cloud users. These coupled with the latency, crash and omission failures in service providers, results in an inefficient use of VMs, increased complexity in VM-management tasks, a reduction in the overall computation power and increased time for task completion. In this paper, a high performance cloud computing strategy is proposed that combines the adaptation of a parallel processing framework, such as the Message Passing Interface (MPI) and an efficient checkpoint infrastructure for VMs, enabling its effective use for scientific computing. By developing such a mechanism, we can achieve optimized runtimes comparable to native clusters, improve checkpoints with low interference on task execution and provide efficient task recovery. In addition, check pointing is used to minimize the cost and volatility of resource provisioning, while improving overall reliability. Analysis and simulations show that the proposed approach compares favorably with the native cluster MPI implementations.

Ekpe Okorafor | E. Okorafor

[1] John B. Shoven,et al. I , Edinburgh Medical and Surgical Journal.

[2] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[3] Erik Seligman,et al. Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..

[4] Armin R. Mikler,et al. NetPIPE: A Network Protocol Independent Performance Evaluator , 1996 .

[5] Daniel Marques,et al. Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[6] HarrisTim,et al. Xen and the art of virtualization , 2003 .

[7] Eli M. Dow,et al. Xen and the Art of Repeated Research , 2004, USENIX Annual Technical Conference, FREENIX Track.

[8] William Gropp,et al. Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[9] Weimin Zheng,et al. User-level checkpoint and recovery for LAM/MPI , 2005, OPSR.

[10] James E. Smith,et al. The architecture of virtual machines , 2005, Computer.

[11] Shantenu Jha,et al. Scientific grid computing: the first generation , 2005, Computing in Science & Engineering.

[12] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[13] Ludmila Cherkasova,et al. Measuring CPU Overhead for I/O Processing in the Xen Virtual Machine Monitor , 2005, USENIX ATC, General Track.

[14] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[15] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[16] Christian Engelmann,et al. Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[17] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[18] Geoffrey C. Fox,et al. MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[19] John Paul Walters,et al. Replication-Based Fault Tolerance for MPI Applications , 2009, IEEE Transactions on Parallel and Distributed Systems.

[20] Ewing Lusk,et al. Fault Tolerance in MPI Programs , 2009 .

[21] Ekpe Okorafor. High Performance Cloud Computing: An Emerging Strategy for Scientific Computing , 2010, GCA.

[22] Lakshmi Sobhana Kalli,et al. Market-Oriented Cloud Computing : Vision , Hype , and Reality for Delivering IT Services as Computing , 2013 .