Enabling Data and Compute Intensive Workflows in Bioinformatics

Accelerated growth in the field of bioinformatics has resulted in large data sets being produced and analyzed. With this rapid growth has come the need to analyze these data in a quick, easy, scalable, and reliable manner on a variety of computing infrastructures including desktops, clusters, grids and clouds. This paper presents the application of workflow technologies, and, specifically, Pegasus WMS, a robust scientific workflow management system, to a variety of bioinformatics projects from RNA sequencing, proteomics, and data quality control in population studies using GWAS data.

[1]  Brian D. O'Connor,et al.  SeqWare Query Engine: storing and searching sequence data in the cloud , 2010, BMC Bioinformatics.

[2]  Hua Xu,et al.  Automated diagnosis of LC-MS/MS performance , 2009, Bioinform..

[3]  Carl Kesselman,et al.  GriPhyN and LIGO, building a virtual data Grid for gravitational wave scientists , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[4]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[5]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[6]  Paolo Romano,et al.  Automation of in-silico data analysis processes through workflow management systems , 2007, Briefings Bioinform..

[7]  C. Carlson,et al.  The Next PAGE in Understanding Complex Traits: Design for the Analysis of Population Architecture Using Genetics and Epidemiology (PAGE) Study , 2011, American journal of epidemiology.

[8]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[9]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[10]  Edward A. Lee,et al.  Taming heterogeneity - the Ptolemy approach , 2003, Proc. IEEE.

[11]  R. Vakkalanka,et al.  DISC1 splice variants are upregulated in schizophrenia and associated with risk polymorphisms , 2009, Proceedings of the National Academy of Sciences.

[12]  Ewa Deelman,et al.  Pegasus: Mapping Large-Scale Workflows to Distributed Resources , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[13]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[14]  Shi Mei WFMS:WORKFLOW MANAGEMENT SYSTEM , 1999 .

[15]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[16]  Ting Chen,et al.  PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds , 2009, Bioinform..

[17]  G. Bruce Berriman,et al.  Data Sharing Options for Scientific Workflows on Amazon EC2 , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Ewa Deelman,et al.  A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis , 2011, 2011 IEEE Seventh International Conference on eScience.

[19]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[20]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[21]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[22]  Ying Wang,et al.  RseqFlow: workflows for RNA-Seq data analysis , 2011, Bioinform..