Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data

Steps in scientific workflows often generate collections of results, causing the data flowing through workflows to become increasingly nested. Because conventional workflow components (or actors) typically operate on simple or application-specific data types, additional actors often are required to manage these nested data collections. As a result, conventional workflows become increasingly complex as data becomes more nested. This paper describes a new paradigm for developing scientific workflows that transparently manages nested data collections. Collection-oriented workflows have a number of advantages over conventional approaches including simpler workflow designs (e.g., requiring fewer actors and control-flow constructs) that are invariant under changes in data nesting. Our implementation within the Kepler scientific workflow system enables the explicit representation of collections and collection schemas, concurrent operation over collection contents via multi-level pipeline parallelism, and allows collection-aware actors to be composed readily from conventional actors.

[1]  Alon Y. Halevy,et al.  An XML query engine for network-bound data , 2002, The VLDB Journal.

[2]  Felix Naumann,et al.  (Almost) Hands-Off Information Integration for the Life Sciences , 2005, CIDR.

[3]  Dana H. Brooks,et al.  SCIRun/BioPSE: integrated problem solving environment for bioelectric field problems and visualization , 2004, 2004 2nd IEEE International Symposium on Biomedical Imaging: Nano to Macro (IEEE Cat No. 04EX821).

[4]  Dan Suciu,et al.  Stream processing of XPath queries with predicates , 2003, SIGMOD '03.

[5]  Jussi Myllymaki,et al.  Implementing a scalable XML publish/subscribe system using relational database systems , 2004, SIGMOD '04.

[6]  MayWolfgang XPath-logic and XPathLog: A logic-programming style XML data manipulation language , 2004 .

[7]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[8]  Carmem S. Hara,et al.  Querying an Object-Oriented Database Using CPL , 1997 .

[9]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[10]  Gilles Kahn,et al.  Coroutines and Networks of Parallel Processes , 1977, IFIP Congress.

[11]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[12]  Enrico Motta,et al.  The Semantic Web - ISWC 2005, 4th International Semantic Web Conference, ISWC 2005, Galway, Ireland, November 6-10, 2005, Proceedings , 2005, SEMWEB.

[13]  Wolfgang May,et al.  XPath-logic and XPathLog: A logic-programming style XML data manipulation language , 2003, Theory and Practice of Logic Programming.

[14]  Shawn Bowers,et al.  An approach for pipelining nested collections in scientific workflows , 2005, SGMD.

[15]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[16]  Limsoon Wong,et al.  Principles of Programming with Complex Objects and Collection Types , 1995, Theor. Comput. Sci..

[17]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[18]  Edward A. Lee,et al.  Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[19]  Yolanda Gil,et al.  Pegasus: Mapping Scientific Workflows onto the Grid , 2004, European Across Grids Conference.

[20]  Ian J. Taylor,et al.  Triana: a graphical Web service composition and execution toolkit , 2004, Proceedings. IEEE International Conference on Web Services, 2004..

[21]  D. Maddison,et al.  NEXUS: an extensible file format for systematic information. , 1997, Systematic biology.

[22]  Carole A. Goble,et al.  Seven Bottlenecks to Workflow Reuse and Repurposing , 2005, International Semantic Web Conference.

[23]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[24]  Murali Mani,et al.  Taxonomy of XML schema languages using formal language theory , 2005, TOIT.