Scientific workflow management and the Kepler system

Many scientific disciplines are now data and information driven, and new scientific knowledge is often gained by scientists putting together data analysis and knowledge discovery ‘pipelines’. A related trend is that more and more scientific communities realize the benefits of sharing their data and computational services, and are thus contributing to a distributed data and computational community infrastructure (a.k.a. ‘the Grid’). However, this infrastructure is only a means to an end and ideally scientists should not be too concerned with its existence. The goal is for scientists to focus on development and use of what we call scientific workflows. These are networks of analytical steps that may involve, e.g., database access and querying steps, data analysis and mining steps, and many other steps including computationally intensive jobs on high‐performance cluster computers. In this paper we describe characteristics of and requirements for scientific workflows as identified in a number of our application projects. We then elaborate on Kepler, a particular scientific workflow system, currently under development across a number of scientific data management projects. We describe some key features of Kepler and its underlying Ptolemy II system, planned extensions, and areas of future research. Kepler is a community‐driven, open source project, and we always welcome related projects and new contributors to join. Copyright © 2005 John Wiley & Sons, Ltd.

[1]  利久 亀井,et al.  California Institute of Technology , 1958, Nature.

[2]  Gilles Kahn,et al.  Coroutines and Networks of Parallel Processes , 1977, IFIP Congress.

[3]  Editors , 1986, Brain Research Bulletin.

[4]  Mark S. Gordon,et al.  General atomic and molecular electronic structure system , 1993, J. Comput. Chem..

[5]  P. Taylor The San Diego Supercomputer Center , 1994, IEEE Computational Science and Engineering.

[6]  Edward A. Lee,et al.  Dataflow process networks , 1995, Proc. IEEE.

[7]  Hideki John Reekie,et al.  Realtime Signal Processing Data∞ow, Visual, and Functional Programming , 1995 .

[8]  Martin Brown,et al.  The dataflow visualization pipeline as a problem solving environment , 1996 .

[9]  Mathias Weske,et al.  Using workflow management in DNA sequencing , 1996, Proceedings First IFCIS International Conference on Cooperative Information Systems.

[10]  A. Favero,et al.  Italy , 1996, The Lancet.

[11]  Sushil Jajodia,et al.  Advanced Transaction Models and Architectures , 2012, Springer US.

[12]  Gustavo Alonso,et al.  Workflow Management Systems: The Next Generation Of Distributed Processing Tools , 1997 .

[13]  Anastasia Ailamaki,et al.  Scientific workflow management by database management , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[14]  Amit P. Sheth,et al.  Changing Focus on Interoperability in Information Systems:From System, Syntax, Structure to Semantics , 1999 .

[15]  Scott R. Kohn,et al.  Toward a Common Component Architecture for High-Performance Scientific Computing , 1999, HPDC.

[16]  Jie Liu,et al.  HETEROGENEOUS CONCURRENT MODELING AND DESIGN , 1999 .

[17]  David Abramson,et al.  High performance parametric modeling with Nimrod/G: killer application for the global grid? , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[18]  Francisco Curbera,et al.  Web services description language (wsdl) version 1. 2 , 2001 .

[19]  Calton Pu,et al.  An XML-enabled data extraction toolkit for web sources , 2001, Inf. Syst..

[20]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[21]  T. Werner Target gene identification from expression array data by promoter analysis. , 2001, Biomolecular engineering.

[22]  Arvind,et al.  Implicit parallel programming in pH , 2001 .

[23]  Mark Gahegan,et al.  GeoVISTA studio: a codeless visual programming environment for geoscientific data analysis and visualization , 2002 .

[24]  Kees M. van Hee,et al.  Workflow Management: Models, Methods, and Systems , 2002, Cooperative information systems.

[25]  Steffen Staab,et al.  Web Services: Been There, Done That? (Trends and Controversies) , 2003 .

[26]  Calton Pu,et al.  A modeling and execution environment for distributed scientific workflows , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[27]  W.M.P. van der Aalst,et al.  Don't go with the flow: web services composition standards exposed , 2003 .

[28]  Bertram Ludäscher,et al.  Compiling abstract scientific workflows into Web service workflows , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[29]  Bertram Ludäscher Technical Note : SciDAC-SPA-TN-2003-01 On Providing Declarative Design and Programming Constructs for Scientific Workflows based on Process Networks , 2003 .

[30]  Bartosz Kiepusewski,et al.  Expressiveness and suitability of languages for control flow modelling in workflows , 2003 .

[31]  Edward A. Lee,et al.  Taming heterogeneity - the Ptolemy approach , 2003, Proc. IEEE.

[32]  Yolanda Gil,et al.  Planning for workflow construction and maintenance on the Grid , 2003 .

[33]  Bertram Ludäscher,et al.  A Model-Based Mediator System for Scientific Data Management , 2003, Bioinformatics.

[34]  Bertram Ludäscher,et al.  Web service composition through declarative queries: the case of conjunctive queries with union and negation , 2004, Proceedings. 20th International Conference on Data Engineering.

[35]  Adam Arbree,et al.  Mapping Abstract Complex Workflows onto Grid Environments , 2003, Journal of Grid Computing.

[36]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[37]  Bertram Ludäscher,et al.  On integrating scientific resources through semantic registration , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[38]  Wil M. P. van der Aalst,et al.  Workflow Patterns , 2004, Distributed and Parallel Databases.

[39]  Bertram Ludäscher,et al.  An Ontology-Driven Framework for Data Transformation in Scientific Workflows , 2004, DILS.

[40]  Bertram Ludäscher,et al.  Processing Unions of Conjunctive Queries with Negation under Limited Access Patterns , 2004, EDBT.

[41]  Matjaz B. Juric,et al.  Business process execution language for web services , 2004 .

[42]  T. McPhillips Pipelined scientific workflows for inferring evolutionary relationships , 2005 .

[43]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[44]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[45]  Kurt Zimmerman,et al.  Visualization in the SCIRun Problem-Solving Environment , 2005, The Visualization Handbook.

[46]  Bertram Ludäscher,et al.  Actor-Oriented Design of Scientific Workflows , 2005, ER.

[47]  Jason Maassen,et al.  Programming Scientific and Distributed Workflow with Triana Services , 2004 .

[48]  Bertram Ludäscher,et al.  A knowledge environment for the biodiversity and ecological sciences , 2007, Journal of Intelligent Information Systems.

[49]  Jan Mendling Business Process Execution Language for Web Service (BPEL) , 2006 .