Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM)

MOTIVATION A critical task in high-throughput sequencing is aligning millions of short reads to a reference genome. Alignment is especially complicated for RNA sequencing (RNA-Seq) because of RNA splicing. A number of RNA-Seq algorithms are available, and claim to align reads with high accuracy and efficiency while detecting splice junctions. RNA-Seq data are discrete in nature; therefore, with reasonable gene models and comparative metrics RNA-Seq data can be simulated to sufficient accuracy to enable meaningful benchmarking of alignment algorithms. The exercise to rigorously compare all viable published RNA-Seq algorithms has not been performed previously. RESULTS We developed an RNA-Seq simulator that models the main impediments to RNA alignment, including alternative splicing, insertions, deletions, substitutions, sequencing errors and intron signal. We used this simulator to measure the accuracy and robustness of available algorithms at the base and junction levels. Additionally, we used reverse transcription-polymerase chain reaction (RT-PCR) and Sanger sequencing to validate the ability of the algorithms to detect novel transcript features such as novel exons and alternative splicing in RNA-Seq data from mouse retina. A pipeline based on BLAT was developed to explore the performance of established tools for this problem, and to compare it to the recently developed methods. This pipeline, the RNA-Seq Unified Mapper (RUM), performs comparably to the best current aligners and provides an advantageous combination of accuracy, speed and usability. AVAILABILITY The RUM pipeline is distributed via the Amazon Cloud and for computing clusters using the Sun Grid Engine (http://cbil.upenn.edu/RUM). CONTACT ggrant@pcbi.upenn.edu; epierce@mail.med.upenn.edu SUPPLEMENTARY INFORMATION The RNA-Seq sequence reads described in the article are deposited at GEO, accession GSE26248.

[1]  D. Carrasco,et al.  BCL9 promotes tumor progression by conferring enhanced proliferative, metastatic, and angiogenic properties to cancer cells. , 2009, Cancer research.

[2]  Martin M. Matzuk,et al.  MLL2 Is Required in Oocytes for Bulk Histone 3 Lysine 4 Trimethylation and Transcriptional Silencing , 2010, PLoS biology.

[3]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[4]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[5]  S. Gabriel,et al.  Advances in understanding cancer genomes through second-generation sequencing , 2010, Nature Reviews Genetics.

[6]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[7]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[8]  Kang Zhang,et al.  A splice-site mutation in a retina-specific exon of BBS8 causes nonsyndromic retinitis pigmentosa. , 2010, American journal of human genetics.

[9]  Tanya M. Teslovich,et al.  Basal body dysfunction is a likely cause of pleiotropic Bardet–Biedl syndrome , 2003, Nature.

[10]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[11]  Emily H Turner,et al.  Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome , 2010, Nature Genetics.

[12]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[13]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[14]  G. Sherlock,et al.  Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads , 2010, BMC Genomics.

[15]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[16]  Thomas Werner,et al.  Next generation sequencing in functional genomics , 2010, Briefings Bioinform..

[17]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[18]  Nicholas Katsanis,et al.  The ciliopathies: an emerging class of human genetic disorders. , 2006, Annual review of genomics and human genetics.

[19]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[20]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[21]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[22]  Ruedi Aebersold,et al.  Activator-mediated recruitment of the MLL2 methyltransferase complex to the beta-globin locus. , 2007, Molecular cell.

[23]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[24]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[25]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[26]  Alan R. Earls,et al.  Digital equipment corporation. , 2004, Analytical chemistry.

[27]  Y. Fukushima,et al.  Kabuki make-up syndrome: a syndrome of mental retardation, unusual facies, large and protruding ears, and postnatal growth deficiency. , 1981, The Journal of pediatrics.

[28]  J. Derisi,et al.  HMMSplicer: A Tool for Efficient and Sensitive Discovery of Known and Novel Splice Junctions in RNA-Seq Data , 2010, PloS one.

[29]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[30]  Brian E. Howard,et al.  Towards reliable isoform quantification using RNA-SEQ data , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.