Building and provisioning bioinformatics environments on public and private Clouds

Unlike newly developed Web applications that can be designed from the ground up to utilize cloud APIs and run natively within cloud infrastructure, most complex bioinformatics pipelines that are in advanced states of development can only be encapsulated within VMs along with all their software and data dependencies. To take advantage of the scalability offered by the cloud, additional frameworks are required to create virtualized compute clusters and emulate the most common infrastructure found on institutional resources where most of the existing bioinformatics pipelines are generally run. In this paper we describe one such framework, its compatibility with multiple Clouds and present an automated process for deploying the entire system so it can be made easily available on any Cloud.

[1]  David R. Riley,et al.  CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing , 2011, BMC Bioinformatics.

[2]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[3]  Anton Nekrutenko,et al.  Dissemination of scientific software with Galaxy ToolShed , 2014, Genome Biology.

[4]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[5]  Enis Afgan,et al.  BioBlend: automating pipeline analyses within Galaxy and CloudMan , 2013, Bioinform..

[6]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[7]  Anton Nekrutenko,et al.  Galaxy CloudMan: delivering cloud compute clusters , 2010, BMC Bioinformatics.

[8]  Anton Nekrutenko,et al.  Galaxy: A Gateway to Tools in e-Science , 2011, Guide to e-Science.

[9]  Anton Nekrutenko,et al.  Harnessing cloud computing with Galaxy Cloud , 2011, Nature Biotechnology.

[10]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[11]  Ilkay Altintas,et al.  Distributed workflow-driven analysis of large-scale biological data using biokepler , 2011, PDAC '11.

[12]  Katarzyna Keahey,et al.  Contextualization: Providing One-Click Virtual Clusters , 2008, 2008 IEEE Fourth International Conference on eScience.

[13]  Gianmauro Cuccuru,et al.  BioBlend.objects: metacomputing with Galaxy , 2014, Bioinform..

[14]  Anton Nekrutenko,et al.  Wrangling Galaxy’s reference data , 2014, Bioinform..

[15]  Konstantinos Krampis,et al.  Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community , 2012, BMC Bioinformatics.

[16]  Enis Afgan,et al.  CloudMan as a platform for tool, data, and analysis distribution , 2012, BMC Bioinformatics.

[17]  M. C. Schatz,et al.  The DNA data deluge , 2013, IEEE Spectrum.

[18]  Anton Nekrutenko,et al.  A reference model for deploying applications in virtualized environments , 2012, Concurr. Comput. Pract. Exp..