Parallelization of Scientific Workflows in the Cloud

Nowadays, more and more scientific experiments need to handle massive amounts of data. Their data processing consists of multiple computational steps and dependencies within them. A data-intensive scientific workflow is an appropriate tool for modeling such process. Since the execution of data-intensive scientific workflows requires large-scale computing and storage resources, a cloud environment, which provides virtually infinite resources is appealing. However, because of the general geographical distribution of scientific groups collaborating in the experiments, multisite management of data-intensive scientific workflows in the cloud is becoming an important problem. This paper presents a general study of the current state of the art of data-intensive scientific workflow execution in the cloud and corresponding multisite management techniques.

[1]  Carole A. Goble,et al.  Taverna, Reloaded , 2010, SSDBM.

[2]  Radu Prodan Online Analysis and Runtime Steering of Dynamic Workflows in the ASKALON Grid Environment , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[3]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[4]  Chase Qishi Wu,et al.  Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint , 2013, Journal of Grid Computing.

[5]  Marta Mattoso,et al.  SciCumulus: A Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[6]  Johan Montagnat,et al.  Fine-Grain Interoperability of Scientific Workflows in Distributed Computing Infrastructures , 2013, Journal of Grid Computing.

[7]  Radu Prodan,et al.  Extending Grids with cloud resource management for scientific computing , 2009, 2009 10th IEEE/ACM International Conference on Grid Computing.

[8]  Daniel S. Katz,et al.  Turbine: a distributed-memory dataflow engine for extreme-scale many-task applications , 2012, SWEET '12.

[9]  Ewa Deelman,et al.  Integration of Workflow Partitioning and Resource Provisioning , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[10]  Radu Prodan,et al.  Budget-Constrained Resource Provisioning for Scientific Applications in Clouds , 2013, 2013 IEEE 5th International Conference on Cloud Computing Technology and Science.

[11]  Jack Dongarra,et al.  MPI - The Complete Reference: Volume 1, The MPI Core , 1998 .

[12]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[13]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[14]  Ramin Yahyapour,et al.  Service Level Agreements for Cloud Computing , 2011 .

[15]  Yolanda Gil,et al.  Provenance trails in the Wings-Pegasus system , 2008 .

[16]  Anne H. H. Ngu,et al.  Business versus Scientific Workflows: A Comparative Study , 2009, 2009 Congress on Services - I.

[17]  Marta Mattoso,et al.  Dimensioning the virtual cluster for parallel scientific workflows in clouds , 2013, Science Cloud '13.

[18]  Ewa Deelman,et al.  The cost of doing science on the cloud: the Montage example , 2008, HiPC 2008.

[19]  Marta Mattoso,et al.  Provenance traces of the swift parallel scripting system , 2013, EDBT '13.

[20]  Ewa Deelman,et al.  Wrangler: virtual cluster provisioning for the cloud , 2011, HPDC '11.

[21]  Anton Nekrutenko,et al.  Lessons learned from Galaxy, a Web-based platform for high-throughput genomic analyses , 2012, 2012 IEEE 8th International Conference on E-Science.

[22]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[23]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[24]  Ilkay Altintas,et al.  Provenance Collection Support in the Kepler Scientific Workflow System , 2006, IPAW.

[25]  Jianwu Wang,et al.  Early Cloud Experiences with the Kepler Scientific Workflow System , 2012, ICCS.

[26]  Jianwu Wang,et al.  Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems , 2009, WORKS '09.

[27]  Marta Mattoso,et al.  Capturing and querying workflow runtime provenance with PROV: a practical approach , 2013, EDBT '13.

[28]  Ken Kennedy,et al.  TaskScheduling Strategies forWorkflow-based Applications inGrids , 2005 .

[29]  Ewa Deelman,et al.  Partitioning and Scheduling Workflows across Multiple Sites with Storage Constraints , 2011, PPAM.

[30]  Geoffrey C. Fox,et al.  MPJ: MPI-like message passing for Java , 2000 .

[31]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004 .

[32]  Borja Sotomayor,et al.  Deploying Bioinformatics Workflows on Clouds with Galaxy and Globus Provision , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[33]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[34]  R. Prodan,et al.  GroudSim: An Event-Based Simulation Framework for Computational Grids and Clouds , 2010, Euro-Par Workshops.

[35]  David Abramson,et al.  Scheduling Multiple Parameter Sweep Workflow Instances on the Grid , 2009, 2009 Fifth IEEE International Conference on e-Science.

[36]  Ian J. Taylor,et al.  The Triana Workflow Environment: Architecture and Applications , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[37]  Weisong Shi,et al.  An Adaptive Rescheduling Strategy for Grid Workflow Applications , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[38]  Gabriel Antoniu,et al.  TomusBlobs: Towards Communication-Efficient Storage for MapReduce Applications in Azure , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[39]  Justin M. Wozniak,et al.  Coasters: Uniform Resource Provisioning and Access for Clouds and Grids , 2011, 2011 Fourth IEEE International Conference on Utility and Cloud Computing.

[40]  Marta Mattoso,et al.  An adaptive parallel execution strategy for cloud‐based scientific workflows , 2012, Concurr. Comput. Pract. Exp..

[41]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[42]  Marta Mattoso,et al.  Handling Failures in Parallel Scientific Workflows Using Clouds , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[43]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[44]  Carole A. Goble,et al.  Taverna Mobile: Taverna workflows on Android , 2013, ArXiv.

[45]  Marta Mattoso,et al.  SciPhy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets in Protozoan Genomes , 2011, BSB.

[46]  Eduardo Huedo,et al.  A framework for adaptive execution in grids , 2004, Softw. Pract. Exp..

[47]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[48]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[49]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[50]  Douglas Thain,et al.  Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and grids , 2012, SWEET '12.

[51]  Anton Nekrutenko,et al.  Galaxy CloudMan: delivering cloud compute clusters , 2010, BMC Bioinformatics.

[52]  Viktor Kuncak,et al.  Verifying a File System Implementation , 2004, ICFEM.

[53]  Marta Mattoso,et al.  A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds , 2012, Journal of Grid Computing.

[54]  Ewa Deelman,et al.  A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis , 2011, 2011 IEEE Seventh International Conference on eScience.

[55]  Daniel S. Katz,et al.  Workflow task clustering for best effort systems with Pegasus , 2008, Mardi Gras Conference.

[56]  Marta Mattoso,et al.  Algebraic dataflows for big data analysis , 2013, 2013 IEEE International Conference on Big Data.

[57]  Cosimo Anglano,et al.  Scheduling algorithms for multiple Bag-of-Task applications on Desktop Grids: A knowledge-free approach , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[58]  Daniel S. Katz,et al.  Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking , 2009, Int. J. Comput. Sci. Eng..

[59]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[60]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM 2011.

[61]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[62]  Jarek Nabrzyski,et al.  Cost- and deadline-constrained provisioning for scientific workflow ensembles in IaaS clouds , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[63]  Martin Hofmann-Apitius,et al.  A new optimization phase for scientific workflow management systems , 2012, eScience.

[64]  Frank Leymann,et al.  Conventional Workflow Technology for Scientific Simulation , 2011, Guide to e-Science.

[65]  G. Bruce Berriman,et al.  Using Clouds for Science, is it just Kicking the Can down the Road? , 2012, CLOSER.

[66]  Jun Qin,et al.  ASKALON: A Development and Grid Computing Environment for Scientific Workflows , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[67]  Duy Nguyen,et al.  EBC: Application-level migration on multi-site cloud , 2012, 2012 International Conference on Systems and Informatics (ICSAI2012).

[68]  N. Mangala,et al.  Galaxy Workflow Integration on Garuda Grid , 2012, 2012 IEEE 21st International Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises.

[69]  S. Karthik,et al.  A fault tolerent approach in scientific workflow systems based on cloud computing , 2013, 2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering.

[70]  Mathias Weske,et al.  Advanced Topics in Workflow Management: Issues, Requirements, and solutions , 2003, Trans. SDPS.

[71]  Marta Mattoso,et al.  User-steering of HPC workflows: state-of-the-art and future directions , 2013, SWEET '13.

[72]  Carole A. Goble,et al.  Distilling structure in Taverna scientific workflows: a refactoring approach , 2014, BMC Bioinformatics.

[73]  Moustafa Ghanem,et al.  Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support , 2012, BMC Bioinformatics.

[74]  G. Bruce Berriman,et al.  Comparing FutureGrid, Amazon EC2, and Open Science Grid for Scientific Workflows , 2013, Computing in Science & Engineering.

[75]  Ian Korf,et al.  BLAST - an essential guide to the basic local alignment search tool , 2003 .

[76]  Kohei Ichikawa,et al.  An implementation of a multi-site virtual cluster cloud , 2013, The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE).

[77]  Marta Mattoso,et al.  Performance evaluation of parallel strategies in public clouds: A study with phylogenomic workflows , 2013, Future Gener. Comput. Syst..

[78]  Rizos Sakellariou,et al.  Balanced Task Clustering in Scientific Workflows , 2013, 2013 IEEE 9th International Conference on e-Science.

[79]  Luc Bouganim,et al.  Dynamic query scheduling in data integration systems , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[80]  Osamu Tatebe,et al.  Workflow Scheduling to Minimize Data Movement Using Multi-constraint Graph Partitioning , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[81]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[82]  Jianwu Wang,et al.  Provenance for MapReduce-based data-intensive workflows , 2011, WORKS '11.

[83]  Marta Mattoso,et al.  Abstract: Using Provenance to Visualize Data from Large-Scale Experiments , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[84]  Marta Mattoso,et al.  Discovering drug targets for neglected diseases using a pharmacophylogenomic cloud workflow , 2012, 2012 IEEE 8th International Conference on E-Science.

[85]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[86]  Ewa Deelman,et al.  Scientific Workflows in the Cloud , 2011 .

[87]  Gabriel Antoniu,et al.  BlobSeer: Next-generation data management for large scale infrastructures , 2011, J. Parallel Distributed Comput..

[88]  Ian J. Taylor,et al.  A General Approach to Real-Time Workflow Monitoring , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[89]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[90]  Ulf Leser,et al.  Parallelization in Scientific Workflow Management Systems , 2013, ArXiv.

[91]  Carole A. Goble,et al.  Taverna/myGrid: Aligning a Workflow System with the Life Sciences Community , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[92]  Grant Erickson,et al.  A 64-bit, shared disk file system for Linux , 1999, 16th IEEE Symposium on Mass Storage Systems in cooperation with the 7th NASA Goddard Conference on Mass Storage Systems and Technologies (Cat. No.99CB37098).

[93]  Alexander S. Szalay,et al.  Data Diffusion: Dynamic Resource Provision and Data-Aware Scheduling for Data Intensive Applications , 2008, ArXiv.

[94]  Marta Mattoso,et al.  Chiron: a parallel engine for algebraic scientific workflows , 2013, Concurr. Comput. Pract. Exp..

[95]  Marta Mattoso,et al.  Towards supporting the life cycle of large scale scientific experiments , 2010, Int. J. Bus. Process. Integr. Manag..

[96]  David E. Smith,et al.  Integrating Policy with Scientific Workflow Management for Data-Intensive Applications , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[97]  Konrad Campowsky,et al.  BonFIRE: A Multi-cloud Test Facility for Internet of Services Experimentation , 2012, TRIDENTCOM.

[98]  Reza Akbarinia,et al.  P2P Techniques for Decentralized Applications , 2012, Synthesis Lectures on Data Management.

[99]  Martin Hofmann-Apitius,et al.  A new optimization phase for scientific workflow management systems , 2012, 2012 IEEE 8th International Conference on E-Science.

[100]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[101]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[102]  Vipin Kumar,et al.  Multilevel Algorithms for Multi-Constraint Graph Partitioning , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[103]  Marta Mattoso,et al.  Evaluating parameter sweep workflows in high performance computing , 2012, SWEET '12.

[104]  Yolanda Gil,et al.  Wings for Pegasus: A Semantic Approach to Creating Very Large Scientific Workflows , 2006, OWLED.

[105]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[106]  Xiao Liu,et al.  A cost-effective strategy for intermediate data storage in scientific cloud workflow systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[107]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[108]  Fábio Coutinho,et al.  A Workflow Scheduling Algorithm for Optimizing Energy-Efficient Grid Resources Usage , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[109]  Marta Mattoso,et al.  Exploring Molecular Evolution Reconstruction Using a Parallel Cloud Based Scientific Workflow , 2012, BSB.

[110]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[111]  David E. Konerding,et al.  An Essential Guide to the Basic Local Alignment Search Tool: BLAST , 2004 .

[112]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[113]  Ewa Deelman,et al.  Online Fault and Anomaly Detection for Large-Scale Scientific Workflows , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[114]  Rizos Sakellariou,et al.  Scheduling Data-IntensiveWorkflows onto Storage-Constrained Distributed Resources , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[115]  D. Martin Swany,et al.  Online workflow management and performance analysis with Stampede , 2011, 2011 7th International Conference on Network and Service Management.

[116]  Marta Mattoso,et al.  An algebraic approach for data-centric scientific workflows , 2011, Proc. VLDB Endow..

[117]  Yong Zhao,et al.  Scientific Workflow Systems for 21st Century, New Bottle or New Wine? , 2008, 2008 IEEE Congress on Services - Part I.

[118]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[119]  C. Partridge,et al.  Innovations in Internetworking , 1988 .

[120]  R. F. Freund,et al.  Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).