Scientific Workflows in the Cloud

The development of cloud computing has generated significant interest in the scientific computing community. In this chapter we consider the impact of cloud computing on scientific workflow applications. We examine the benefits and drawbacks of cloud computing for workflows, and argue that the primary benefit of cloud computing is not the economic model it promotes, but rather the technologies it employs and how they enable new features for workflow applications. We describe how clouds can be configured to execute workflow tasks and present a case study that examines the performance and cost of three typical workflow applications on Amazon EC2. Finally, we identify several areas in which existing clouds can be improved and discuss the future of workflows in the cloud.

[1]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[2]  Rhys Newman,et al.  Performance implications of virtualization and hyper-threading on high energy physics applications in a grid environment , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[3]  Mei-Hui Su,et al.  Characterization of scientific workflows , 2008, 2008 Third Workshop on Workflows in Support of Large-Scale Science.

[4]  G. Bruce Berriman,et al.  On the Use of Cloud Computing for Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[5]  David Brumley,et al.  Virtual Appliances for Deploying and Maintaining Software , 2003, LISA.

[6]  Patricia J. Teller,et al.  Proceedings of the 2008 ACM/IEEE conference on Supercomputing , 2008, HiPC 2008.

[7]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[8]  Carl Kesselman,et al.  A provisioning model and its comparison with best-effort for performance-cost optimization in grids , 2007, HPDC '07.

[9]  Ewa Deelman,et al.  Resource Provisioning Options for Large-Scale Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[10]  Robert Ross,et al.  Implementation and performance of a parallel file system for high performance distributed applications , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[11]  Gabriele Garzoglio,et al.  Open Science Grid , 2011 .

[12]  Carl Kesselman,et al.  Enabling personal clusters on demand for batch resources using commodity software , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[13]  Borja Sotomayor,et al.  Globus® Toolkit 4, First Edition: Programming Java Services (The Morgan Kaufmann Series in Networking) , 2005 .

[14]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[15]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[16]  Richard Wolski,et al.  The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software , 2008, HPDC '08.

[17]  Borja Sotomayor,et al.  Globus toolkit 4 : programming Java services , 2006 .

[18]  Shi Mei WFMS:WORKFLOW MANAGEMENT SYSTEM , 1999 .

[19]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[20]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[21]  Paul T. Groth,et al.  Pipeline-centric provenance model , 2009, WORKS '09.

[22]  Ewa Deelman,et al.  Experiences with resource provisioning for scientific workflows using Corral , 2010 .

[23]  Ewa Deelman,et al.  The cost of doing science on the cloud: the Montage example , 2008, HiPC 2008.

[24]  David E. Irwin,et al.  Dynamic virtual clusters in a grid site manager , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[25]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[26]  Borja Sotomayor,et al.  Virtual Clusters for Grid Communities , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[27]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[28]  Andrew Warfield,et al.  Xen and the art of virtualization , 2003, SOSP '03.

[29]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[30]  Matei Ripeanu,et al.  Amazon S3 for science grids: a viable solution? , 2008, DADC '08.

[31]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[32]  G. Bruce Berriman,et al.  Scientific workflow applications on Amazon EC2 , 2010, 2009 5th IEEE International Conference on E-Science Workshops.

[33]  Jeffrey S. Vetter,et al.  Xen-Based HPC: A Parallel I/O Perspective , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[34]  Becky Verastegui,et al.  Proceedings of the 2007 ACM/IEEE conference on Supercomputing , 2007, HiPC 2007.

[35]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[36]  Katarzyna Keahey,et al.  Contextualization: Providing One-Click Virtual Clusters , 2008, 2008 IEEE Fourth International Conference on eScience.

[37]  Miron Livny,et al.  Pegasus and DAGMan From Concept to Execution: Mapping Scientific Workflows onto Today's Cyberinfrastructure , 2006, High Performance Computing Workshop.

[38]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.