Search, adapt, and reuse: the future of scientific workflows

Over the last years, a number of scientific workflow management systems (SciWFM) have been brought to a state of maturity that should permit their usage in a production-style environment. This is especially true for the Life Sciences, but SciWFM also attract considerable attention in fields like geophysics or climate research. These developments, accompanied by the growing availability of analytical tools wrapped as (web) services, were driven by a series of very interesting promises: End users will be empowered to develop their own pipelines; reuse of services will be enhanced by easier integration into custom workflows; time necessary for developing analysis pipelines will decrease; etc. But despite all efforts, SciWFM have not yet found widespread acceptance in their intended audience. In this paper, we argue that a wider adoption of SciWFM will only be achieved if the focus of research and development is shifted from methods for developing and running workflows to searching, adapting, and reusing existing workflows. Only by this shift can SciWFM outreach to the mass of domain scientists actually performing scientific analysis - and with little interest in developing them themselves. To this end, SciWFM need to be combined with communitywide workflow repositories allowing users to find solutions for their scientific needs (coded as a workflow). In this vision paper, we show how and where such developments have already started and highlight new research questions arising.

[1]  Alberto O. Mendelzon,et al.  GraphLog: a visual formalism for real life recursion , 1990, PODS '90.

[2]  Cláudio T. Silva,et al.  Querying and re-using workflows with VsTrails , 2008, SIGMOD Conference.

[3]  Paolo Missier,et al.  Linking multiple workflow provenance traces for interoperable collaborative science , 2010, The 5th Workshop on Workflows in Support of Large-Scale Science.

[4]  Yogesh L. Simmhan,et al.  Special Issue: The First Provenance Challenge , 2008, Concurr. Comput. Pract. Exp..

[5]  Catriel Beeri,et al.  Querying Business Processes with BP-QL , 2005, VLDB.

[6]  Norman W. Paton,et al.  Fine-grained and efficient lineage querying of collection-based workflow provenance , 2010, EDBT '10.

[7]  Erhard Rahm,et al.  Schema and ontology matching with COMA++ , 2005, SIGMOD '05.

[8]  Carmem S. Hara,et al.  Querying and Managing Provenance through User Views in Scientific Workflows , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[9]  Daniela Grigori,et al.  Behavioral matchmaking for service retrieval: application to conversation protocols , 2006, BDA.

[10]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[11]  Bertram Ludäscher,et al.  Techniques for efficiently querying scientific workflow provenance graphs , 2010, EDBT '10.

[12]  Carole A. Goble,et al.  State of the nation in data integration for bioinformatics , 2008, J. Biomed. Informatics.

[13]  Ulf Leser,et al.  Adapters, shims, and glue - service interoperability for in silico experiments , 2006, Bioinform..

[14]  Sanjeev Khanna,et al.  Differencing Provenance in Scientific Workflows , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[15]  Janko Calic,et al.  Interactive search and browsing interface for large-scale visual repositories , 2009, Multimedia Tools and Applications.

[16]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[17]  Ulf Leser,et al.  Analysis of Affymetrix Exon Arrays , 2010 .

[18]  Yi Chen,et al.  Searching workflows with hierarchical views , 2010, Proc. VLDB Endow..

[19]  Luana Licata,et al.  Linking entries in protein interaction database to structured text: The FEBS Letters experiment , 2008, FEBS letters.

[20]  Carole A. Goble,et al.  The design and realisation of the myExperiment Virtual Research Environment for social sharing of workflows , 2009, Future Gener. Comput. Syst..

[21]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[22]  Rainer Spang,et al.  Computational diagnostics with gene expression profiles. , 2008, Methods in molecular biology.

[23]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[24]  Claudio Gentile,et al.  Incremental Algorithms for Hierarchical Classification , 2004, J. Mach. Learn. Res..

[25]  J. M. Hancock,et al.  Post-publication sharing of data and tools , 2009, Nature.

[26]  Wolfgang Reisig,et al.  Analysis Techniques for Service Models , 2006, Second International Symposium on Leveraging Applications of Formal Methods, Verification and Validation (isola 2006).

[27]  Carole Goble,et al.  Discovering Scientific Workflows: The myExperiment Benchmarks , 2008 .

[28]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[29]  David Maier,et al.  Smoothing the ROI Curve for Scientific Data Management Applications , 2007, CIDR.

[30]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[31]  Carole A. Goble,et al.  BioCatalogue: a universal catalogue of web services for the life sciences , 2010, Nucleic Acids Res..

[32]  Sherif Sakr,et al.  Querying Graph-Based Repositories of Business Process Models , 2010, DASFAA Workshops.

[33]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[34]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[35]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[36]  Mark Gerstein,et al.  Publishing perishing? Towards tomorrow's information architecture , 2007, BMC Bioinformatics.

[37]  Bertram Ludäscher,et al.  Compiling abstract scientific workflows into Web service workflows , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[38]  Michael A. Charleston,et al.  Differential variability analysis of gene expression and its application to human diseases , 2008, ISMB.

[39]  Sarah Cohen Boulakia,et al.  Provenance in Scientific Databases , 2009, Encyclopedia of Database Systems.

[40]  Felix Naumann,et al.  METL: Managing and Integrating ETL Processes , 2009, VLDB PhD Workshop.

[41]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval) , 2004 .

[42]  Ambuj K. Singh,et al.  Graphs-at-a-time: query language and access methods for graph databases , 2008, SIGMOD Conference.

[43]  Nello Cristianini,et al.  Introduction to computational genomics - a case studies approach , 2007 .

[44]  Ben Taskar,et al.  Exploring repositories of scientific workflows , 2010, Wands '10.

[45]  Lincoln D. Stein,et al.  Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges , 2008, Nature Reviews Genetics.

[46]  Peter Mittler,et al.  State of the nation , 1995 .

[47]  Jayant Madhavan,et al.  OpenII: an open source information integration toolkit , 2010, SIGMOD Conference.

[48]  Taghi M. Khoshgoftaar,et al.  A Survey of Collaborative Filtering Techniques , 2009, Adv. Artif. Intell..

[49]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[50]  Bertram Ludäscher,et al.  Actor-Oriented Design of Scientific Workflows , 2005, ER.

[51]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[52]  Verena Kantere,et al.  Managing scientific data , 2010, Commun. ACM.

[53]  Emmanuel Barillot,et al.  Selecting biomedical data sources according to user preferences , 2004, ISMB/ECCB.

[54]  Carole A. Goble,et al.  The Data Playground: An Intuitive Workflow Specification Environment , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[55]  Lloyd A. Free,et al.  STATE OF THE NATION , 1973 .

[56]  Jason Maassen,et al.  Programming Scientific and Distributed Workflow with Triana Services , 2004 .

[57]  Ian Foster,et al.  Special Issue: The First Provenance Challenge , 2008 .

[58]  Val Tannen,et al.  Querying data provenance , 2010, SIGMOD Conference.