OLego: fast and sensitive mapping of spliced mRNA-Seq reads using small seeds

A crucial step in analyzing mRNA-Seq data is to accurately and efficiently map hundreds of millions of reads to the reference genome and exon junctions. Here we present OLego, an algorithm specifically designed for de novo mapping of spliced mRNA-Seq reads. OLego adopts a multiple-seed-and-extend scheme, and does not rely on a separate external aligner. It achieves high sensitivity of junction detection by strategic searches with small seeds (∼14 nt for mammalian genomes). To improve accuracy and resolve ambiguous mapping at junctions, OLego uses a built-in statistical model to score exon junctions by splice-site strength and intron size. Burrows–Wheeler transform is used in multiple steps of the algorithm to efficiently map seeds, locate junctions and identify small exons. OLego is implemented in C++ with fully multithreaded execution, and allows fast processing of large-scale data. We systematically evaluated the performance of OLego in comparison with published tools using both simulated and real data. OLego demonstrated better sensitivity, higher or comparable accuracy and substantially improved speed. OLego also identified hundreds of novel micro-exons (<30 nt) in the mouse transcriptome, many of which are phylogenetically conserved and can be validated experimentally in vivo. OLego is freely available at http://zhanglab.c2b2.columbia.edu/index.php/OLego.

[1]  Michael Ruogu Zhang,et al.  A sequence compilation and comparison of exons that are alternatively spliced in neurons. , 1994, Nucleic acids research.

[2]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[3]  Michael Ruogu Zhang,et al.  Statistical features of human exons and their flanking regions. , 1998, Human molecular genetics.

[4]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[5]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[6]  S. Berget,et al.  A 5′ Splice Site-Proximal Enhancer Binds SF1 and Activates Exon Bridging of a Microexon , 2000, Molecular and Cellular Biology.

[7]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[8]  A. Krainer,et al.  Pre-mRNA splicing in the new millennium. , 2001, Current opinion in cell biology.

[9]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[10]  D. Black Mechanisms of alternative pre-messenger RNA splicing. , 2003, Annual review of biochemistry.

[11]  Steven L Salzberg,et al.  Computational discovery of internal micro-exons. , 2003, Genome research.

[12]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[13]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[14]  L. Maquat Nonsense-mediated mRNA decay: splicing, translation and mRNP dynamics , 2004, Nature Reviews Molecular Cell Biology.

[15]  P. Baldi,et al.  The architecture of pre-mRNAs affects mechanisms of splice-site pairing. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[16]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[17]  Tyson A. Clark,et al.  Nova regulates brain-specific splicing to shape the synapse , 2005, Nature Genetics.

[18]  B. Blencowe Alternative Splicing: New Insights from Global Analyses , 2006, Cell.

[19]  Christopher J. Lee,et al.  Alternative splicing and RNA selection pressure — evolutionary consequences for eukaryotic genomes , 2006, Nature Reviews Genetics.

[20]  Tyson A. Clark,et al.  Discovery of tissue-specific exons using comprehensive human exon microarrays , 2007, Genome Biology.

[21]  Donny D. Licatalosi,et al.  Splicing Regulation in Neurologic Disease , 2006, Neuron.

[22]  Michael Q. Zhang,et al.  Dual-specificity splice sites function alternatively as 5′ and 3′ splice sites , 2007, Proceedings of the National Academy of Sciences.

[23]  Lise Getoor,et al.  SplicePort—An interactive splice-site analysis tool , 2007, Nucleic Acids Res..

[24]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[25]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[26]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[27]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[28]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[29]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[30]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[31]  Lili Wan,et al.  RNA and Disease , 2009, Cell.

[32]  J. Fak,et al.  Chaolin Zhang and Its Combinatorial Controls Integrative Modeling Defines the Nova Splicing-Regulatory Network , 2013 .

[33]  T. Nilsen,et al.  Expansion of the eukaryotic proteome by alternative splicing , 2010, Nature.

[34]  M. Kimmel,et al.  Conflict of interest statement. None declared. , 2010 .

[35]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[36]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[37]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[38]  Siu-Ming Yiu,et al.  SOAPsplice: Genome-Wide ab initio Detection of Splice Junctions from RNA-Seq Data , 2011, Front. Gene..

[39]  Brian P. Brunk,et al.  Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM) , 2011, Bioinform..

[40]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[41]  Gautier Koscielny,et al.  Ensembl 2012 , 2011, Nucleic Acids Res..

[42]  Kai Ye,et al.  PASSion: a pattern growth algorithm-based pipeline for splice junction detection in paired-end RNA-Seq data , 2012, Bioinform..

[43]  M. Borodovsky,et al.  TrueSight: a new algorithm for splice junction detection using RNA-seq , 2012, Nucleic acids research.

[44]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.