Scientific workflow design for mere mortals

Recent years have seen a dramatic increase in research and development of scientific workflow systems. These systems promise to make scientists more productive by automating data-driven and compute-intensive analyses. Despite many early achievements, the long-term success of scientific workflow technology critically depends on making these systems useable by ''mere mortals'', i.e., scientists who have a very good idea of the analysis methods they wish to assemble, but who are neither software developers nor scripting-language experts. With these users in mind, we identify a set of desiderata for scientific workflow systems crucial for enabling scientists to model and design the workflows they wish to automate themselves. As a first step towards meeting these requirements, we also show how the collection-oriented modeling and design (comad) approach for scientific workflows, implemented within the Kepler system, can help provide these critical, design-oriented capabilities to scientists.

[1]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[2]  Geoffrey C. Fox,et al.  Examining the Challenges of Scientific Workflows , 2007, Computer.

[3]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[4]  Joel H. Saltz,et al.  An Efficient and Reliable Scientific Workflow System , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[5]  Louis O. Hertzberger,et al.  A Grid-Based Virtual Laboratory , 2002 .

[6]  Ann Q. Gates,et al.  Workflow-Driven Ontologies: An Earth Sciences Case Study , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[7]  Ilkay Altintas,et al.  Provenance Collection Support in the Kepler Scientific Workflow System , 2006, IPAW.

[8]  Rajkumar Buyya,et al.  The Gridbus toolkit for service oriented grid and utility computing: an overview and status report , 2004, 1st IEEE International Workshop on Grid Economics and Business Models, 2004. GECON 2004..

[9]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.

[10]  Jun Qin,et al.  ASKALON: a Grid application development and computing environment , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[11]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD 2000.

[12]  Bartosz Balis,et al.  K-WfGrid Distributed Monitoring and Performance Analysis Services for Workflows in the Grid , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[13]  Cláudio T. Silva,et al.  VisTrails: enabling interactive multiple-view visualizations , 2005, VIS 05. IEEE Visualization, 2005..

[14]  Carole A. Goble,et al.  Designing the myExperiment Virtual Research Environment for the Social Sharing of Workflows , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[15]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[16]  Anne H. H. Ngu,et al.  Enabling ScientificWorkflow Reuse through Structured Composition of Dataflow and Control-Flow , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[17]  Dan Suciu,et al.  Processing XML streams with deterministic automata and stream indexes , 2004, TODS.

[18]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[19]  Michael Stonebraker,et al.  Fault-tolerance in the borealis distributed stream processing system , 2008, ACM Trans. Database Syst..

[20]  G. Alonso,et al.  Parallel computing patterns for Grid workflows , 2006, 2006 Workshop on Workflows in Support of Large-Scale Science.

[21]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[22]  Susan B. Davidson,et al.  An Efficient XPath Query Processor for XML Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[23]  Marian Bubak,et al.  Collaborative Virtual Laboratory for e-Health , 2007 .

[24]  Bertram Ludäscher,et al.  CONCURRENCY AND COMPUTATION : PRACTICE AND EXPERIENCE Concurrency Computat , 2008 .

[25]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[26]  Bertram Ludäscher,et al.  Change-Resilient Design and Dataflow Optimization for Distributed XML Stream Processors , 2007 .

[27]  Subhash Saini,et al.  GridFlow: workflow management for grid computing , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[28]  Wil M. P. van der Aalst,et al.  Workflow Data Patterns: Identification, Representation and Tool Support , 2005, ER.

[29]  Ulf Leser,et al.  Adapters, shims, and glue - service interoperability for in silico experiments , 2006, Bioinform..

[30]  Carole A. Goble,et al.  Recycling workflows and services through discovery and reuse , 2007, Concurr. Comput. Pract. Exp..

[31]  LudäscherBertram,et al.  Scientific workflow design for mere mortals , 2009 .

[32]  Cees T. A. M. de Laat,et al.  VLAM-G: a grid-based virtual laboratory , 2002, Future Gener. Comput. Syst..

[33]  Bertram Ludäscher,et al.  Actor-Oriented Design of Scientific Workflows , 2005, ER.

[34]  Yolanda Gil,et al.  Wings for Pegasus: Creating Large-Scale Scientific Applications Using Semantic Representations of Computational Workflows , 2007, AAAI.

[35]  Edward A. Lee,et al.  Dataflow process networks , 2001 .

[36]  Edward A. Lee,et al.  A framework for comparing models of computation , 1998, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[37]  Miriam Cunningham,et al.  Building the knowledge economy : issues, applications, case studies , 2003 .

[38]  Robert Stevens,et al.  Treating Shimantic Web Syndrome with Ontologies , 2004 .

[39]  Yong Zhao,et al.  Applying the Virtual Data Provenance Model , 2006, IPAW.

[40]  Ian J. Taylor,et al.  Triana: a graphical Web service composition and execution toolkit , 2004, Proceedings. IEEE International Conference on Web Services, 2004..

[41]  D. Maddison,et al.  NEXUS: an extensible file format for systematic information. , 1997, Systematic biology.

[42]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[43]  Daniel Zinn Modeling and optimization of scientific workflows , 2008, Ph.D. '08.

[44]  Ian Foster,et al.  Special Issue: The First Provenance Challenge , 2008 .

[45]  Bertram Ludäscher,et al.  Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data , 2006, DILS.

[46]  Stefanie Scherzinger,et al.  Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams , 2004, VLDB.

[47]  Michael Luck,et al.  A Protocol for Recording Provenance in Service-Oriented Grids , 2004, OPODIS.

[48]  LudäscherBertram,et al.  Scientific workflow management and the Kepler system , 2006 .