Scientific Workflows and Provenance: Introduction and Research Opportunities

Scientific workflows are becoming increasingly popular for compute-intensive and data-intensive scientific applications. The vision and promise of scientific workflows includes rapid, easy workflow design, reuse, scalable execution, and other advantages, e.g., to facilitate “reproducible science” through provenance (e.g., data lineage) support. However, as described in the paper, important research challenges remain. While the database community has studied (business) workflow technologies extensively in the past, most current work in scientific workflows seems to be done outside of the database community, e.g., by practitioners and researchers in the computational sciences and eScience. We provide a brief introduction to scientific workflows and provenance, and identify areas and problems that suggest new opportunities for database research.

[1]  Ulf Leser,et al.  Regular Path Queries on Large Graphs , 2012, SSDBM.

[2]  Carole A. Goble,et al.  Taverna, Reloaded , 2010, SSDBM.

[3]  Alberto O. Mendelzon,et al.  Finding Regular Simple Paths in Graph Databases , 1989, SIAM J. Comput..

[4]  Brian Campbell,et al.  Amortised Memory Analysis Using the Depth of Data Structures , 2009, ESOP.

[5]  Bertram Ludäscher,et al.  A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows , 2006, IPAW.

[6]  Bertram Ludäscher,et al.  Scientific workflow design for mere mortals , 2009, Future Gener. Comput. Syst..

[7]  Scientific Workflow Systems , 1996 .

[8]  Bertram Ludäscher,et al.  Techniques for efficiently querying scientific workflow provenance graphs , 2010, EDBT '10.

[9]  Zhao Zhang,et al.  Parallel Scripting for Applications at the Petascale and Beyond , 2009, Computer.

[10]  Yanhong A. Liu,et al.  Graph queries through datalog optimizations , 2010, PPDP.

[11]  Scott Klasky,et al.  Workflow automation for processing plasma fusion simulation data , 2007, WORKS '07.

[12]  Radu Prodan,et al.  Scheduling of scientific workflows in the ASKALON grid environment , 2005, SGMD.

[13]  Andreas Wombacher,et al.  Data Workflow - A Workflow Model for Continuous Data Processing , 2010 .

[14]  James Cheney,et al.  Principles of Provenance (Dagstuhl Seminar 12091) , 2012, Dagstuhl Reports.

[15]  Yang Xiang,et al.  Path-tree: An efficient reachability indexing scheme for large directed graphs , 2011, TODS.

[16]  Bertram Ludäscher,et al.  Actor-Oriented Design of Scientific Workflows , 2005, ER.

[17]  Foto AfratiNational,et al.  Chain Queries Expressible by Linear Datalog Programs , 1997 .

[18]  Zhiming Zhao,et al.  Scientific Workflows , 2006, Sci. Program..

[19]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[20]  Simon Miles Electronically Querying for the Provenance of Entities , 2006, IPAW.

[21]  Alfonso Valencia,et al.  Interoperability with Moby 1.0--it's better than sharing your toothbrush! , 2008, Briefings in bioinformatics.

[22]  Ling Liu,et al.  Encyclopedia of Database Systems , 2009, Encyclopedia of Database Systems.

[23]  Bill Howe,et al.  Scientific Mashups: Runtime-Configurable Data Product Ensembles , 2008, 2008 IEEE Fourth International Conference on eScience.

[24]  Paul W. P. J. Grefen,et al.  Business process model repositories:framework and survey , 2009 .

[25]  Bertram Ludäscher,et al.  Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data , 2006, DILS.

[26]  Yolanda Gil,et al.  Wings for Pegasus: Creating Large-Scale Scientific Applications Using Semantic Representations of Computational Workflows , 2007, AAAI.

[27]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28]  Serge Abiteboul,et al.  Distributed Datalog Revisited , 2010, Datalog.

[29]  Edward A. Lee,et al.  Composing Different Models of Computation in Kepler and Ptolemy II , 2007, International Conference on Computational Science.

[30]  Edward A. Lee,et al.  Taming heterogeneity - the Ptolemy approach , 2003, Proc. IEEE.

[31]  Laura M. Haas,et al.  Clio: Schema Mapping Creation and Data Exchange , 2009, Conceptual Modeling: Foundations and Applications.

[32]  Amit Sheth NSF Workshop on Workflow and Process Automation in Information Systems: State-of-the-Art and Future Directions , 1997, SIGG.

[33]  Ulf Leser,et al.  Search, adapt, and reuse: the future of scientific workflows , 2011, SGMD.

[34]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[35]  Sanjeev Khanna,et al.  An optimal labeling scheme for workflow provenance using skeleton labels , 2010, SIGMOD Conference.

[36]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..

[37]  Margo I. Seltzer,et al.  Layering in Provenance Systems , 2009, USENIX Annual Technical Conference.

[38]  Carole A. Goble,et al.  The design and realisation of the myExperiment Virtual Research Environment for social sharing of workflows , 2009, Future Gener. Comput. Syst..

[39]  Jing Hua,et al.  A Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows , 2009, 2009 IEEE International Conference on Services Computing.

[40]  Jacek Sroka,et al.  DFL: A dataflow language based on Petri nets and nested relational calculus , 2008, Inf. Syst..

[41]  Kaizar Amin,et al.  GridAnt: a client-controllable grid workflow system , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[42]  Susan B. Davidson,et al.  Zoom*UserViews: Querying Relevant Provenance in Workflow Systems , 2007, VLDB.

[43]  Scott Klasky,et al.  Scientific Process Automation and Workflow Management , 2009, Scientific Data Management.

[44]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[45]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[46]  Daniel Deutch,et al.  A structural/temporal query language for Business Processes , 2012, J. Comput. Syst. Sci..

[47]  Susan B. Davidson,et al.  Privacy issues in scientific workflow provenance , 2010, Wands '10.

[48]  Robert Stevens,et al.  Treating Shimantic Web Syndrome with Ontologies , 2004 .

[49]  Margo I. Seltzer,et al.  Issues in Automatic Provenance Collection , 2006, IPAW.

[50]  Yolanda Gil,et al.  Pegasus: Mapping Scientific Workflows onto the Grid , 2004, European Across Grids Conference.

[51]  Marta Mattoso,et al.  Provenance Query Patterns for Many-Task Scientific Computing , 2011, TaPP.

[52]  Cláudio T. Silva,et al.  Managing Rapidly-Evolving Scientific Workflows , 2006, IPAW.

[53]  Joseph M. Hellerstein,et al.  The declarative imperative: experiences and conjectures in distributed logic , 2010, SGMD.

[54]  Carsten Griwodz,et al.  The Nornir run-time system for parallel programs using Kahn process networks on multi-core machines—a flexible alternative to MapReduce , 2009, 2009 Sixth IFIP International Conference on Network and Parallel Computing.

[55]  Robert A. Morris,et al.  Kurator: A Kepler Package for Data Curation Workflows , 2012, ICCS.

[56]  Jianwu Wang,et al.  Early Cloud Experiences with the Kepler Scientific Workflow System , 2012, ICCS.

[57]  Bertram Ludäscher,et al.  X-CSR: Dataflow Optimization for Distributed XML Process Pipelines , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[58]  Bertram Ludäscher,et al.  Scientific workflow design 2.0: Demonstrating streaming data collections in Kepler , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[59]  Jun Qin,et al.  Scientific Workflows , 2012, Springer Berlin Heidelberg.

[60]  Marta Mattoso,et al.  Provenance management in Swift , 2011, Future Gener. Comput. Syst..

[61]  Edward A. Lee,et al.  Dataflow process networks , 1995, Proc. IEEE.

[62]  Ian T. Foster,et al.  Building Scientific Workflow with Taverna and BPEL: A Comparative Study in caGrid , 2009, ICSOC Workshops.

[63]  Y. Simmhan,et al.  Towards Reliable, Performant Workflows for Streaming-Applications on Cloud Platforms , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[64]  Carsten Griwodz,et al.  Kahn Process Networks are a Flexible Alternative to MapReduce , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[65]  Bertram Ludäscher,et al.  XML-based computation for scientific workflows , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[66]  Jianyong Wang,et al.  Providing built-in keyword search capabilities in RDBMS , 2011, The VLDB Journal.

[67]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[68]  Peter T. Wood,et al.  Query languages for graph databases , 2012, SGMD.

[69]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[70]  John Hughes Programming with Arrows , 2004, Advanced Functional Programming.

[71]  Paolo Missier,et al.  Linking multiple workflow provenance traces for interoperable collaborative science , 2010, The 5th Workshop on Workflows in Support of Large-Scale Science.

[72]  Edward A. Lee,et al.  The Semantics of Dataflow with Firing , 2022 .

[73]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[74]  David Abramson,et al.  Nimrod/K: Towards massively parallel dynamic Grid workflows , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[75]  David Abramson,et al.  Nimrod/K: towards massively parallel dynamic grid workflows , 2008, HiPC 2008.

[76]  Mathias Weske,et al.  Scientific Workflows: Business as Usual? , 2009, BPM.

[77]  Marta Mattoso,et al.  An algebraic approach for data-centric scientific workflows , 2011, Proc. VLDB Endow..

[78]  Bertram Ludäscher,et al.  Scientific workflow design with data assembly lines , 2009, WORKS '09.

[79]  Jianwen Su,et al.  Maintaining Transitive Closure of Graphs in SQL , 1999 .

[80]  Bertram Ludäscher,et al.  Datalog as a Lingua Franca for Provenance Querying and Reasoning , 2012, TaPP.

[81]  Wil M. P. van der Aalst,et al.  Process Mining - Discovery, Conformance and Enhancement of Business Processes , 2011 .

[82]  Norman W. Paton,et al.  Fine-grained and efficient lineage querying of collection-based workflow provenance , 2010, EDBT '10.

[83]  Takeo Kanade,et al.  Service-Oriented Computing - ICSOC 2008 Workshops , 2009 .

[84]  Jianwu Wang,et al.  Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems , 2009, WORKS '09.

[85]  Anastasia Ailamaki,et al.  Scientific workflow management by database management , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[86]  Twan Basten,et al.  Requirements on the Execution of Kahn Process Networks , 2003, ESOP.

[87]  Carole A. Goble,et al.  Taverna Workflows: Syntax and Semantics , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[88]  Bertram Ludäscher,et al.  ProPub: Towards a Declarative Approach for Publishing Customized, Policy-Aware Provenance , 2011, SSDBM.

[89]  Bertram Ludäscher,et al.  Abstract Provenance Graphs: Anticipating and Exploiting Schema-Level Data Provenance , 2010, IPAW.

[90]  Eric Yu,et al.  Conceptual Modeling: Foundations and Applications , 2009 .

[91]  David J. DeWitt,et al.  Integrating databases and workflow systems , 2005, SGMD.

[92]  Adriane Chapman,et al.  Efficient provenance storage , 2008, SIGMOD Conference.

[93]  Anne H. H. Ngu,et al.  Flexible Scientific Workflow Modeling Using Frames, Templates, and Dynamic Embedding , 2008, SSDBM.

[94]  Bertram Ludäscher,et al.  Improving Workflow Fault Tolerance through Provenance-Based Recovery , 2011, SSDBM.

[95]  Shan Shan Huang,et al.  Datalog and emerging applications: an interactive tutorial , 2011, SIGMOD '11.

[96]  Cláudio T. Silva,et al.  Using Mediation to Achieve Provenance Interoperability , 2009, 2009 Congress on Services - I.

[97]  Remco M. Dijkman,et al.  Graph Matching Algorithms for Business Process Model Similarity Search , 2009, BPM.

[98]  Shiyong Lu,et al.  Scientific Workflow Provenance Querying with Security Views , 2008, 2008 The Ninth International Conference on Web-Age Information Management.

[99]  V. Curcin,et al.  Scientific workflow systems - can one size fit all? , 2008, 2008 Cairo International Biomedical Engineering Conference.

[100]  Dan Suciu,et al.  Declarative specification of Web sites with Strudel , 2000, The VLDB Journal.

[101]  Bertram Ludäscher,et al.  An Ontology-Driven Framework for Data Transformation in Scientific Workflows , 2004, DILS.