Impact of concurrency on the performance of a whole exome sequencing pipeline

Background Current high-throughput technologies—i.e. whole genome sequencing, RNA-Seq, ChIP-Seq, etc.—generate huge amounts of data and their usage gets more widespread with each passing year. Complex analysis pipelines involving several computationally-intensive steps have to be applied on an increasing number of samples. Workflow management systems allow parallelization and a more efficient usage of computational power. Nevertheless, this mostly happens by assigning the available cores to a single or few samples’ pipeline at a time. We refer to this approach as naive parallel strategy (NPS). Here, we discuss an alternative approach, which we refer to as concurrent execution strategy (CES), which equally distributes the available processors across every sample’s pipeline. Results Theoretically, we show that the CES results, under loose conditions, in a substantial speedup, with an ideal gain range spanning from 1 to the number of samples. Also, we observe that the CES yields even faster executions since parallelly computable tasks scale sub-linearly. Practically, we tested both strategies on a whole exome sequencing pipeline applied to three publicly available matched tumour-normal sample pairs of gastrointestinal stromal tumour. The CES achieved speedups in latency up to 2–2.4 compared to the NPS. Conclusions Our results hint that if resources distribution is further tailored to fit specific situations, an even greater gain in performance of multiple samples pipelines execution could be achieved. For this to be feasible, a benchmarking of the tools included in the pipeline would be necessary. It is our opinion these benchmarks should be consistently performed by the tools’ developers. Finally, these results suggest that concurrent strategies might also lead to energy and cost savings by making feasible the usage of low power machine clusters.

[1]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[2]  Michael Kotliar,et al.  CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language , 2018 .

[3]  Ernst Houtgast,et al.  GPU-Accelerated BWA-MEM Genomic Mapping Algorithm Using Adaptive Load Balancing , 2016, ARCS.

[4]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[5]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[6]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[7]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[8]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[9]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[10]  Sven Rahmann,et al.  Genome analysis , 2022 .

[11]  Elisabetta Ronchieri,et al.  Cross-Environment Comparison of a Bioinformatics Pipeline: Perspectives for Hybrid Computations , 2018, Euro-Par Workshops.

[12]  Michael Kotliar,et al.  CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language , 2018, bioRxiv.

[13]  Sohrab P. Shah,et al.  Kronos: a workflow assembler for genome analytics and informatics , 2016, bioRxiv.

[14]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[15]  Caroline C. Friedel,et al.  Watchdog – a workflow management system for the distributed analysis of large-scale experimental data , 2018, BMC Bioinformatics.

[16]  Tobias Pietzsch,et al.  An automated workflow for parallel processing of large multiview SPIM recordings , 2015, Bioinform..

[17]  Glenn R. Luecke,et al.  Optimization of SAMtools sorting using OpenMP tasks , 2017, Cluster Computing.

[18]  Stinus Lindgreen,et al.  AdapterRemoval: easy cleaning of next-generation sequencing reads , 2012, BMC Research Notes.

[19]  Bernhard Y. Renard,et al.  MetaMeta: integrating metagenome analysis tools to improve taxonomic profiling , 2017, bioRxiv.

[20]  Bo Li,et al.  VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis , 2018, BMC Bioinformatics.