Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support

BackgroundOver the past decade the workflow system paradigm has evolved as an efficient and user-friendly approach for developing complex bioinformatics applications. Two popular workflow systems that have gained acceptance by the bioinformatics community are Taverna and Galaxy. Each system has a large user-base and supports an ever-growing repository of application workflows. However, workflows developed for one system cannot be imported and executed easily on the other. The lack of interoperability is due to differences in the models of computation, workflow languages, and architectures of both systems. This lack of interoperability limits sharing of workflows between the user communities and leads to duplication of development efforts.ResultsIn this paper, we present Tavaxy, a stand-alone system for creating and executing workflows based on using an extensible set of re-usable workflow patterns. Tavaxy offers a set of new features that simplify and enhance the development of sequence analysis applications: It allows the integration of existing Taverna and Galaxy workflows in a single environment, and supports the use of cloud computing capabilities. The integration of existing Taverna and Galaxy workflows is supported seamlessly at both run-time and design-time levels, based on the concepts of hierarchical workflows and workflow patterns. The use of cloud computing in Tavaxy is flexible, where the users can either instantiate the whole system on the cloud, or delegate the execution of certain sub-workflows to the cloud infrastructure.ConclusionsTavaxy reduces the workflow development cycle by introducing the use of workflow patterns to simplify workflow creation. It enables the re-use and integration of existing (sub-) workflows from Taverna and Galaxy, and allows the creation of hybrid workflows. Its additional features exploit recent advances in high performance cloud computing to cope with the increasing data size and complexity of analysis.The system can be accessed either through a cloud-enabled web-interface or downloaded and installed to run within the user's local environment. All resources related to Tavaxy are available at http://www.tavaxy.org.

[1]  Burkhard Linke,et al.  Conveyor: a workflow engine for bioinformatics analyses , 2011, Bioinform..

[2]  Mathias Weske,et al.  Scientific Workflows: Business as Usual? , 2009, BPM.

[3]  Edward A. Lee,et al.  Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[4]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[5]  Sergei L. Kosakovsky Pond,et al.  Windshield splatter analysis with the Galaxy metagenomic pipeline. , 2009, Genome research.

[6]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[7]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[8]  David Kerk,et al.  Evolutionary Radiation Pattern of Novel Protein Phosphatases Revealed by Analysis of Protein Data from the Completely Sequenced Genomes of Humans, Green Algae, and Higher Plants1[W][OA] , 2007, Plant Physiology.

[9]  Vasa Curcin,et al.  Building and Using Analytical Workflows in Discovery Net , 2009 .

[10]  K. Voelkerding,et al.  Next-generation sequencing: from basic research to diagnostics. , 2009, Clinical chemistry.

[11]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[12]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[13]  P. H. Lindsay Human Information Processing , 1977 .

[14]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[15]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[16]  Wil M. P. van der Aalst,et al.  Workflow Patterns , 2003, Distributed and Parallel Databases.

[17]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[18]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[19]  Anthony Rowe,et al.  The discovery net system for high throughput bioinformatics , 2003, ISMB.

[20]  Ludmila Chistoserdova,et al.  Recent progress and new challenges in metagenomics for biotechnology , 2010, Biotechnology Letters.

[21]  Victor Chang,et al.  The OMII Software Distribution , 2006 .

[22]  Arthur Liberzon,et al.  Using GenePattern for Gene Expression Analysis , 2008, Current protocols in bioinformatics.

[23]  J. Gilbert,et al.  Microbial metagenomics: beyond the genome. , 2011, Annual review of marine science.

[24]  Tao Xu,et al.  Pegasys: software for executing and integrating analyses of biological sequences , 2004, BMC Bioinformatics.

[25]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[26]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[27]  Alla Lapidus,et al.  A Bioinformatician's Guide to Metagenomics , 2008, Microbiology and Molecular Biology Reviews.

[28]  Johan Tordsson,et al.  Three fundamental dimensions of scientific workflow interoperability: Model of computation, language, and execution environment , 2010, Future Gener. Comput. Syst..

[29]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[30]  Ian J. Taylor,et al.  The Triana Workflow Environment: Architecture and Applications , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[31]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[32]  V. Curcin,et al.  Scientific workflow systems - can one size fit all? , 2008, 2008 Cairo International Biomedical Engineering Conference.

[33]  Gilles Kahn,et al.  Coroutines and Networks of Parallel Processes , 1977, IFIP Congress.

[34]  Moustafa Ghanem,et al.  Meta-workflows: pattern-based interoperability between Galaxy and Taverna , 2010, Wands '10.

[35]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[36]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[37]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[38]  Richard K. Wilson,et al.  Challenges of sequencing human genomes , 2010, Briefings Bioinform..

[39]  Stefano Volinia,et al.  GAMES identifies and annotates mutations in next-generation sequencing projects , 2011, Bioinform..

[40]  Matthew S. Shields Control- Versus Data-Driven Workflows , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[41]  Bertram Ludäscher,et al.  Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data , 2006, DILS.

[42]  Carole A. Goble,et al.  myExperiment: a repository and social network for the sharing of bioinformatics workflows , 2010, Nucleic Acids Res..

[43]  Wil M. P. van der Aalst,et al.  Workflow Patterns , 2004, Distributed and Parallel Databases.

[44]  Ian J. Taylor,et al.  Visual Grid Workflow in Triana , 2005, Journal of Grid Computing.

[45]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[46]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.