Modelling Computational Resources for Next Generation Sequencing Bioinformatics Analysis of 16S rRNA Samples

In the rapidly evolving domain of next generation sequencing and bioinformatics analysis, data generation is one aspect that is increasing at a concomitant rate. The burden associated with processing large amounts of sequencing data has emphasised the need to allocate sufficient computing resources to complete analyses in the shortest possible time with manageable and predictable costs. A novel method for predicting time to completion for a popular bioinformatics software (QIIME), was developed using key variables characteristic of the input data assumed to impact processing time. Multiple Linear Regression models were developed to determine run time for two denoising algorithms and a general bioinformatics pipeline. The models were able to accurately predict clock time for denoising sequences from a naturally assembled community dataset, but not an artificial community. Speedup and efficiency tests for AmpliconNoise also highlighted that caution was needed when allocating resources for parallel processing of data. Accurate modelling of computational processing time using easily measurable predictors can assist NGS analysts in determining resource requirements for bioinformatics software and pipelines. Whilst demonstrated on a specific group of scripts, the methodology can be extended to encompass other packages running on multiple architectures, either in parallel or sequentially.

[1]  Samuel V. Angiuoli,et al.  Cunningham: a BLAST Runtime Estimator , 2011 .

[2]  C. Quince,et al.  Accurate determination of microbial diversity from 454 pyrosequencing data , 2009, Nature Methods.

[3]  E. Heidrich,et al.  Evaluation of microbial electrolysis cells in the treatment of domestic wastewate , 2012 .

[4]  Feng Lin,et al.  Solving the Bottleneck Problem in Bioinformatics Computing: An Architectural Perspective , 2007, J. VLSI Signal Process..

[5]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[6]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[7]  Yongchao Liu,et al.  CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform , 2012, Bioinform..

[8]  Lin Xu,et al.  An experimental study of optimizing bioinformatics applications , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[9]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[10]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[11]  B. Haas,et al.  Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. , 2011, Genome research.

[12]  Emese Meglécz,et al.  Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing , 2011, BMC Genomics.

[13]  Francisco José Esteban,et al.  Next-generation bioinformatics: using many-core processor architecture to develop a web service for sequence alignment , 2010, Bioinform..

[14]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[15]  Russell J. Davenport,et al.  Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.

[16]  Samuel V. Angiuoli,et al.  Resources and Costs for Microbial Sequence Analysis Evaluated Using Virtual Machines and Cloud Computing , 2011, PloS one.

[17]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[18]  Rob Knight,et al.  Using QIIME to Analyze 16S rRNA Gene Sequences from Microbial Communities , 2011, Current protocols in bioinformatics.

[19]  Eric P. Nawrocki,et al.  An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea , 2011, The ISME Journal.

[20]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[21]  Christopher Quince,et al.  The rational exploration of microbial diversity , 2008, The ISME Journal.

[22]  R. Knight,et al.  Rapid denoising of pyrosequencing amplicon data: exploiting the rank-abundance distribution , 2010, Nature Methods.

[23]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[24]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[25]  P. Chain,et al.  Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. , 2012, Current opinion in biotechnology.

[26]  Rafael P. Mellado,et al.  Estimation of bacterial diversity using next generation sequencing of 16S rDNA: a comparison of different workflows , 2011, BMC Bioinformatics.

[27]  William A. Walters,et al.  Collaborative cloud-enabled tools allow rapid, reproducible biological insights , 2012, The ISME Journal.