Enabling scalable scientific workflow management in the Cloud

Cloud computing is gaining tremendous momentum in both academia and industry. In this context, we define the term "Cloud Workflow" as the specification, execution and provenance tracking of large-scale scientific workflows, as well as the management of data and computing resources to support the execution of large-scale scientific workflows in the Cloud. In this paper, we first analyze the gap between these two complementary technologies, and what it means to bring Clouds and workflows together. Then, we present the key challenges in supporting Cloud workflows, and present our reference framework for scientific workflow management in the Cloud. Last we present our experience in integrating a scientific workflow management system-Swift into the Cloud. We discuss the performance of cluster provisioning within the OpenNebula Cloud platform, the Eucalyptus Cloud platform and Amazon EC2, and we demonstrate the capability and efficiency of the integration using a NASA MODIS image processing workflow and the Montage image mosaic workflow.Note to practitioners. Scientific workflow management plays a very important role for scientific computing and application coordination, while Cloud computing offers scalability and resource on-demand. We devise autonomous methods to integrate scientific workflow management systems with Cloud platforms and also provision resources for large scale workflows, which can facilitate scientists to easily manage their workflows in the Cloud, and take advantage of large scale Cloud resources. There are a few integration options and many challenges in the process, and the experience we gain will help researchers in migrating their workflow management systems and workflow applications into the Cloud. We analyze the major challenges of running scientific workflows on the Cloud.We propose a reference framework to standardize the integration.The implementation experience proves that the framework is feasible and extendible.Cluster-recycling mechanism can improve the resource provisioning efficiency.

[1]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[2]  Cui Lin,et al.  Designing and Deploying a Scientific Computing Cloud Platform , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[3]  Yong Zhao,et al.  Dynamic Resource Provisioning in Grid Environments , 2007 .

[4]  Matei Ripeanu,et al.  Amazon S3 for science grids: a viable solution? , 2008, DADC '08.

[5]  G. Bruce Berriman,et al.  Data Sharing Options for Scientific Workflows on Amazon EC2 , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Jia Zhang,et al.  Bridging VisTrails Scientific Workflow Management System to High Performance Computing , 2013, 2013 IEEE Ninth World Congress on Services.

[7]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[8]  Lavanya Ramakrishnan,et al.  VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[9]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[10]  Douglas Thain,et al.  The quest for scalable support of data-intensive workloads in distributed systems , 2009, HPDC '09.

[11]  Katarzyna Keahey,et al.  Contextualization: Providing One-Click Virtual Clusters , 2008, 2008 IEEE Fourth International Conference on eScience.

[12]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[13]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[14]  Zhao Zhang,et al.  Parallel Scripting for Applications at the Petascale and Beyond , 2009, Computer.

[15]  Yong Zhao,et al.  Opportunities and Challenges in Running Scientific Workflows on the Cloud , 2011, 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[16]  Rajkumar Buyya,et al.  High-Performance Cloud Computing: A View of Scientific Applications , 2009, 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks.

[17]  Yong Zhao,et al.  A Logic Programming Approach to Scientific Workflow Provenance Querying , 2008, IPAW.

[18]  Carole A. Goble,et al.  Scientific Workflows as Services in caGrid: A Taverna and gRAVI Approach , 2009, 2009 IEEE International Conference on Web Services.

[19]  Gordon Bell,et al.  Beyond the Data Deluge , 2009, Science.

[20]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[21]  Cláudio T. Silva,et al.  Managing Rapidly-Evolving Scientific Workflows , 2006, IPAW.

[22]  G. Bruce Berriman,et al.  Scientific workflow applications on Amazon EC2 , 2010, 2009 5th IEEE International Conference on E-Science Workshops.

[23]  G. Bruce Berriman,et al.  On the Use of Cloud Computing for Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[24]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[25]  Alexander S. Szalay,et al.  Accelerating large-scale data exploration through data diffusion , 2008, DADC '08.

[26]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[27]  Rubén S. Montero,et al.  An elasticity model for High Throughput Computing clusters , 2011, J. Parallel Distributed Comput..

[29]  Jing Hua,et al.  Service-Oriented Architecture for VIEW: A Visual Scientific Workflow Management System , 2008, 2008 IEEE International Conference on Services Computing.

[30]  G. Bruce Berriman,et al.  Comparing FutureGrid, Amazon EC2, and Open Science Grid for Scientific Workflows , 2013, Computing in Science & Engineering.

[31]  Marta Mattoso,et al.  A Performance Evaluation of X-Ray Crystallography Scientific Workflow Using SciCumulus , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[32]  Eero Vainikko,et al.  Scalability of parallel scientific applications on the cloud , 2011, Sci. Program..

[33]  Jeffrey S. Chase,et al.  Dynamic network provisioning for data intensive applications in the cloud , 2012, 2012 IEEE 8th International Conference on E-Science.

[34]  Jing Hua,et al.  A Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows , 2009, 2009 IEEE International Conference on Services Computing.

[35]  Miron Livny,et al.  The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Mario Antonioletti,et al.  eScience, 2008. eScience '08. IEEE Fourth International Conference on , 2008 .

[37]  Jianting Zhang,et al.  Ontology-Driven Composition and Validation of Scientific Grid Workflows in Kepler: a Case Study of Hyperspectral Image Processing , 2006, 2006 Fifth International Conference on Grid and Cooperative Computing Workshops.

[38]  John M. Dennis,et al.  Parallel high-resolution climate data analysis using swift , 2011, MTAGS '11.

[39]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[40]  Kasikrit Damkliang,et al.  Taverna Workflow and Supporting Service for Single Nucleotide Polymorphisms Analysis , 2009, 2009 International Conference on Information Management and Engineering.

[41]  Jing Hua,et al.  A Reference Architecture for Scientific Workflow Management Systems and the VIEW SOA Solution , 2009, IEEE Transactions on Services Computing.

[42]  Jacek Sroka,et al.  Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies , 2012, SIGMOD 2013.

[43]  Lizhe Wang,et al.  Scientific Cloud Computing: Early Definition and Experience , 2008, 2008 10th IEEE International Conference on High Performance Computing and Communications.

[44]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[45]  Shantenu Jha,et al.  Autonomic management of application workflows on hybrid computing infrastructure , 2011, Sci. Program..

[46]  Andreas Neumann,et al.  Oozie: towards a scalable workflow management system for Hadoop , 2012, SWEET '12.

[47]  K. P. Kaliyamurthie,et al.  Multi Cloud Deployment of Computing Clusters for Loosely Coupled MTC Applications , 2013 .

[48]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[49]  Radu Prodan,et al.  Bringing Scientific Workflows to Amazon SWF , 2013, 2013 39th Euromicro Conference on Software Engineering and Advanced Applications.

[50]  Shiyong Lu,et al.  Secure abstraction views for scientific workflow provenance querying , 2010, IEEE Transactions on Services Computing.

[51]  Yong Zhao,et al.  Devising a Cloud Scientific Workflow Platform for Big Data , 2014, 2014 IEEE World Congress on Services.

[52]  Shiyong Lu,et al.  VIEW: a VIsual sciEntificWorkflow management system , 2007, 2007 IEEE Congress on Services (Services 2007).

[53]  Yong Zhao,et al.  A notation and system for expressing and executing cleanly typed workflows on messy scientific data , 2005, SGMD.

[54]  Shiyong Lu,et al.  Itinerary-Based Access Control for Mobile Tasks in Scientific Workflows , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).

[55]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[56]  Shiyong Lu,et al.  Coercion Approach to the Shimming Problem in Scientific Workflows , 2013, 2013 IEEE International Conference on Services Computing.

[57]  Rubén S. Montero,et al.  Multicloud Deployment of Computing Clusters for Loosely Coupled MTC Applications , 2011, IEEE Transactions on Parallel and Distributed Systems.

[58]  Barry V. Hess,et al.  Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis , 2010, HiPC 2010.