BOOTABLE: Bioinformatics benchmark tool suite for applications and hardware

Abstract The interest in analyzing biological data on a large scale has grown over the last years. Bioinformatics applications play an important role when it comes to the analysis of huge amounts of data. Due to the large amount of biological data and/or large problem spaces a considerable amount of computing resources is required to answer the raised research questions. In order to estimate which underlying hardware might be the most suitable for the bioinformatics tools applied, a well-defined benchmark suite is required. Such a benchmark suite can get useful in the case of purchasing hardware and even further for larger projects with the goal to establish a bioinformatics compute infrastructure. With this paper we present BOOTABLE, our bioinformatic benchmark suite. BOOTABLE currently contains seven popular and widely used bioinformatic applications representing a broad spectrum of usage characteristics. It further includes an automated installation procedure and all required datasets. Furthermore it includes functionalities to test any desired application with regards to resource consumption and scaling behavior.

[1]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[2]  Berk Hess,et al.  GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers , 2015 .

[3]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[4]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[5]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[6]  Fernando Harald Barreiro Megino,et al.  Helix Nebula and CERN: A Symbiotic approach to exploiting commercial clouds , 2014 .

[7]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[8]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[9]  Andreas Tauch,et al.  Bioinformatics in Germany: toward a national-level infrastructure , 2017, Briefings Bioinform..

[10]  Dirk von Suchodoletz,et al.  Überlegungen zur Steuerung einer föderativen Infrastruktur am Beispiel von bwCloud , 2016 .

[11]  Adam Godzik,et al.  Fold recognition methods. , 2005, Methods of biochemical analysis.

[12]  K. Reinert,et al.  OpenMS: a flexible open-source software platform for mass spectrometry data analysis , 2016, Nature Methods.

[13]  J. Berg,et al.  Molecular dynamics simulations of biomolecules , 2002, Nature Structural Biology.

[14]  Maurits J. J. Dijkstra,et al.  Multiple Sequence Alignment. , 2017, Methods in molecular biology.

[15]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[16]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[17]  Fabian Sievers,et al.  Clustal Omega, accurate alignment of very large numbers of sequences. , 2014, Methods in molecular biology.

[18]  T. Darden,et al.  Particle mesh Ewald: An N⋅log(N) method for Ewald sums in large systems , 1993 .

[19]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[20]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[21]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[22]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[23]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[24]  Rodrigo Lopez,et al.  A new bioinformatics analysis tools framework at EMBL–EBI , 2010, Nucleic Acids Res..

[25]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[26]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[27]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[28]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[29]  Sergey I. Nikolenko,et al.  BayesHammer: Bayesian clustering for error correction in single-cell sequencing , 2012, BMC Genomics.

[30]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[31]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[32]  Adam Auton,et al.  The 1000 Genomes Project , 2015 .