Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Throughput from sequencing instruments has been increasing in an unprecedented speed, leading to an explosion of the next-generation sequencing (NGS) data, and challenges in storing, managing, and analyzing these datasets. Parallelism is the key in handling large-scale data, and some progress has been made in parallelizing important steps, like sequence alignment. However, other major steps continue to be sequential, limiting the ability to handle massive datasets. In this paper, we focus on parallelizing algorithms from two areas. The first is efficient data format conversion among a wide variety of sequence data formats, which is important for cross-utilization of different analysis modules. The second is statistical analysis. Our parallelization sequence data format converter allows sequence datasets in BAM/SAM format to be converted into multiple formats, including SAM/BAM, BED, FASTA, FASTQ, BEDGRAPH, JSON, and YAML, using both shared memory and distributed memory parallelism. The converter currently comprises three instances: SAM format converter, BAM format converter and preprocessing-optimized SAM format converter. Additionally, our converter can also support partial format conversion, to perform format conversion only on a specified chromosome region. The statistical analysis module includes parallelized non-local means (NL-means) algorithm and false discovery rate (FDR) computation. Through extensive evaluation, we demonstrate high scalability of our framework.

[1]  Eija Korpelainen,et al.  Hadoop-BAM: directly manipulating next generation sequencing data in the cloud , 2012, Bioinform..

[2]  Michela Taufer,et al.  Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce , 2013, BMC Structural Biology.

[3]  Jean-Michel Morel,et al.  A non-local algorithm for image denoising , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[4]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[5]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[6]  Brent S. Pedersen,et al.  Pybedtools: a flexible Python library for manipulating genomic datasets and annotations , 2011, Bioinform..

[7]  Monya Baker,et al.  Next-generation sequencing: adjusting to data overload , 2010, Nature Methods.

[8]  Aleksandra Pizurica,et al.  A GPU-Accelerated Real-Time NLMeans Algorithm for Denoising Color Video Sequences , 2010, ACIVS.

[9]  Luca Pireddu,et al.  MapReducing a genomic sequencing workflow , 2011, MapReduce '11.

[10]  M. Metzker Emerging technologies in DNA sequencing. , 2005, Genome research.

[11]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[12]  C. Hutchison DNA sequencing: bench to bedside and beyond , 2007, Nucleic acids research.

[13]  Yi Wang,et al.  Supporting a Light-Weight Data Management Layer over HDF5 , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[14]  Geoffrey C. Fox,et al.  IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Cloud Technologies for Bioinformatics Applications , 2022 .

[15]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[16]  Gianluigi Zanetti,et al.  Biodoop: Bioinformatics on Hadoop , 2009, 2009 International Conference on Parallel Processing Workshops.

[17]  G. Zanetti,et al.  Parallelizing bioinformatics applications with MapReduce , 2008 .

[18]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[19]  B. Langmead,et al.  Cloud-scale RNA-sequencing differential expression analysis with Myrna , 2010, Genome Biology.

[20]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[21]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[22]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[23]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[24]  Lu Tian,et al.  A signal processing approach for enriched region detection in RNA polymerase II ChIP-seq data , 2012, BMC Bioinformatics.

[25]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[26]  Zongming Fei,et al.  Broadcasting with Prediction and Selective Forwarding in Vehicular Networks , 2013, Int. J. Distributed Sens. Networks.

[27]  Yi Wang,et al.  SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[28]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[29]  Gagan Agrawal,et al.  PAGE: A Framework for Easy PArallelization of GEnomic Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[30]  Kai Wang,et al.  Non-local means denoising algorithm accelerated by GPU , 2009, International Symposium on Multispectral Image Processing and Pattern Recognition.

[31]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[32]  Aaron R. Quinlan,et al.  BamTools: a C++ API and toolkit for analyzing and managing BAM files , 2011, Bioinform..

[33]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.