Life science data analysis workflow development using the bioextract server leveraging the iPlant collaborative cyberinfrastructure

In order to handle the vast quantities of biological data gener6ated by high‐throughput experimental technologies, the BioExtract Server (bioextract.org) has leveraged iPlant Collaborative (www.iplantcollaborative.org) functionality to help address big data storage and analysis issues in the bioinformatics field. The BioExtract Server is a Web‐based, workflow‐enabling system that offers researchers a flexible environment for analyzing genomic data. It provides researchers with the ability to save a series of BioExtract Server tasks (e.g., query a data source, save a data extract, and execute an analytic tool) as a workflow and the opportunity for researchers to share their data extracts, analytic tools, and workflows with collaborators. The iPlant Collaborative is a community of researchers, educators, and students working to enrich science through the development of cyberinfrastructure—the physical computing resources, collaborative environment, virtual machine resources, and interoperable analysis software and data services—that are essential components of modern biology. The iPlant AGAVE Advanced Programming Interface, developed through the iPlant Collaborative, is a hosted, Software‐as‐a‐Service resource providing access to a collection of high performance computing and cloud resources. Leveraging AGAVE, the BioExtract Server gives researchers easy access to multiple high performance computers and delivers computation and storage as dynamically allocated resources via the Internet. © 2014 The Authors. Concurrency and Computation: Practice and Experience published by John Wiley & Sons Ltd.

[1]  C. Lee Giles,et al.  Proceedings of the ACM first workshop on CyberInfrastructure: information management in eScience , 2007, CIKM 2007.

[2]  Arek Kasprzyk,et al.  BioMart: driving a paradigm change in biological data management , 2011, Database J. Biol. Databases Curation.

[3]  Moustafa Ghanem,et al.  Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support , 2012, BMC Bioinformatics.

[4]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[5]  Mark Hedges,et al.  Management and preservation of research data with iRODS , 2007, CIMS '07.

[6]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[7]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[8]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[9]  Jonathan Crabtree,et al.  Ergatis: a web interface and scalable software system for bioinformatics workflows , 2010, Bioinform..

[10]  Daniel C. Stanzione,et al.  The iPlant Collaborative: Cyberinfrastructure to Feed the World , 2011, Computer.

[11]  Bernd Wiswedel,et al.  Extending KNIME for next-generation sequencing data analysis , 2011, Bioinform..

[12]  Bernard J. Pope,et al.  Bpipe: a tool for running and managing bioinformatics pipelines , 2012, Bioinform..

[13]  The UniProt Consortium,et al.  Update on activities at the Universal Protein Resource (UniProt) in 2013 , 2012, Nucleic Acids Res..

[14]  Volker Brendel,et al.  The BioExtract Server: a web-based bioinformatic workflow platform , 2011, Nucleic Acids Res..

[15]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[16]  T. Henzinger,et al.  Executable cell biology , 2007, Nature Biotechnology.

[17]  Daniel C. Stanzione,et al.  iPlant atmosphere: a gateway to cloud infrastructure for the plant sciences , 2011, GCE '11.

[18]  Florian Halbritter,et al.  GeneProf: analysis of high-throughput sequencing experiments , 2011, Nature Methods.

[19]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[20]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[21]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[22]  S. Kurtz The Vmatch large scale sequence analysis software , 2003 .

[23]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[24]  Alberto Anguita,et al.  NCBI2RDF: Enabling Full RDF-Based Access to NCBI Databases , 2013, BioMed research international.

[25]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[26]  Ilkay Altintas,et al.  Distributed workflow-driven analysis of large-scale biological data using biokepler , 2011, PDAC '11.

[27]  Leo Goodstadt,et al.  Ruffus: a lightweight Python library for computational pipelines , 2010, Bioinform..

[28]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[29]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[30]  Etienne Z. Gnimpieba,et al.  Using logic programming for modeling the one-carbon metabolism network to study the impact of folate deficiency on methylation processes. , 2011, Molecular bioSystems.

[31]  Masao Nagasaki,et al.  XiP: a computational environment to create, extend and share workflows , 2013, Bioinform..

[32]  Chris Jordan,et al.  Comprehensive data infrastructure for plant bioinformatics , 2010, 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS).

[33]  Luigi Marini,et al.  Using Lucene to index and search the digitized 1940 US Census , 2013, Concurr. Comput. Pract. Exp..