论文信息 - PVT: An Efficient Computational Procedure to Speed up Next-generation Sequence Analysis

PVT: An Efficient Computational Procedure to Speed up Next-generation Sequence Analysis

BackgroundHigh-throughput Next-Generation Sequencing (NGS) techniques are advancing genomics and molecular biology research. This technology generates substantially large data which puts up a major challenge to the scientists for an efficient, cost and time effective solution to analyse such data. Further, for the different types of NGS data, there are certain common challenging steps involved in analysing those data. Spliced alignment is one such fundamental step in NGS data analysis which is extremely computational intensive as well as time consuming. There exists serious problem even with the most widely used spliced alignment tools. TopHat is one such widely used spliced alignment tools which although supports multithreading, does not efficiently utilize computational resources in terms of CPU utilization and memory. Here we have introduced PVT (Pipelined Version of TopHat) where we take up a modular approach by breaking TopHat’s serial execution into a pipeline of multiple stages, thereby increasing the degree of parallelization and computational resource utilization. Thus we address the discrepancies in TopHat so as to analyze large NGS data efficiently.ResultsWe analysed the SRA dataset (SRX026839 and SRX026838) consisting of single end reads and SRA data SRR1027730 consisting of paired-end reads. We used TopHat v2.0.8 to analyse these datasets and noted the CPU usage, memory footprint and execution time during spliced alignment. With this basic information, we designed PVT, a pipelined version of TopHat that removes the redundant computational steps during ‘spliced alignment’ and breaks the job into a pipeline of multiple stages (each comprising of different step(s)) to improve its resource utilization, thus reducing the execution time.ConclusionsPVT provides an improvement over TopHat for spliced alignment of NGS data analysis. PVT thus resulted in the reduction of the execution time to ~23% for the single end read dataset. Further, PVT designed for paired end reads showed an improved performance of ~41% over TopHat (for the chosen data) with respect to execution time. Moreover we propose PVT-Cloud which implements PVT pipeline in cloud computing system.

Subhasis Dasgupta | Sunirmal Khatua | Ranjan Kumar Maji | Arijita Sarkar | Zhumur Ghosh

[1] M. Morris Mano,et al. Computer system architecture (3. ed.) , 1993 .

[2] V. Marx. Biology: The big challenges of big data , 2013, Nature.

[3] I. Amit,et al. Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[4] G. Hon,et al. Next-generation genomics: an integrative approach , 2010, Nature Reviews Genetics.

[5] Inanç Birol,et al. De novo transcriptome assembly with ABySS , 2009, Bioinform..

[6] Michael P Snyder,et al. High-throughput sequencing for biology and medicine , 2013, Molecular systems biology.

[7] M. Schatz,et al. Searching for SNPs with cloud computing , 2009, Genome Biology.

[8] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[9] M. Morris Mano,et al. Computer system architecture , 1982 .

[10] LiHeng,et al. The Sequence Alignment/Map format and SAMtools , 2009 .

[11] Richard Durbin,et al. Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[12] Gonçalo R. Abecasis,et al. The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[13] B. Haas,et al. Advancing RNA-Seq analysis , 2010, Nature Biotechnology.

[14] Daniel J. Blankenberg,et al. Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[15] Siu-Ming Yiu,et al. SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[16] BMC Bioinformatics , 2005 .

[17] Steven J. M. Jones,et al. De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[18] B. Williams,et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[19] Alicia R. Martin,et al. STORMSeq: An Open-Source, User-Friendly Pipeline for Processing Personal Genomics Data in the Cloud , 2014, PloS one.

[20] Gunnar Rätsch,et al. Optimal spliced alignments of short sequence reads , 2008, BMC Bioinformatics.

[21] Serban Nacu,et al. Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[22] Steven L Salzberg,et al. Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[23] Rajkumar Buyya,et al. Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[24] Nandini Mukherjee,et al. Optimizing the utilization of virtual resources in Cloud environment , 2010, 2010 IEEE International Conference on Virtual Environments, Human-Computer Interfaces and Measurement Systems.

[25] Cole Trapnell,et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[26] Cole Trapnell,et al. Computational methods for transcriptome annotation and quantification using RNA-seq , 2011, Nature Methods.

[27] E. Mardis. ChIP-seq: welcome to the new frontier , 2007, Nature Methods.

[28] Caroline C. Friedel,et al. A Comprehensive Evaluation of Alignment Algorithms in the Context of RNA-Seq , 2012, PloS one.