TrueSight: a new algorithm for splice junction detection using RNA-seq

RNA-seq has proven to be a powerful technique for transcriptome profiling based on next-generation sequencing (NGS) technologies. However, due to the short length of NGS reads, it is challenging to accurately map RNA-seq reads to splice junctions (SJs), which is a critically important step in the analysis of alternative splicing (AS) and isoform construction. In this article, we describe a new method, called TrueSight, which for the first time combines RNA-seq read mapping quality and coding potential of genomic sequences into a unified model. The model is further utilized in a machine-learning approach to precisely identify SJs. Both simulations and real data evaluations showed that TrueSight achieved higher sensitivity and specificity than other methods. We applied TrueSight to new high coverage honey bee RNA-seq data to discover novel splice forms. We found that 60.3% of honey bee multi-exon genes are alternatively spliced. By utilizing gene models improved by TrueSight, we characterized AS types in honey bee transcriptome. We believe that TrueSight will be highly useful to comprehensively study the biology of alternative splicing.

[1]  Rodger Staden,et al.  Methods to define and locate patterns of motifs in sequences , 1988, Comput. Appl. Biosci..

[2]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[3]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[4]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[5]  R. Guigó,et al.  GeneID in Drosophila. , 2000, Genome research.

[6]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[7]  T A Thanaraj,et al.  Positional characterisation of false positives from computational prediction of human splice sites. , 2000, Nucleic acids research.

[8]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[9]  W. Gish,et al.  Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. , 2001, Genome research.

[10]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[11]  Massih-Reza Amini,et al.  Semi Supervised Logistic Regression , 2002, ECAI.

[12]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[13]  G. Robinson,et al.  Gene Expression Profiles in the Brain Predict Behavior in Individual Honey Bees , 2003, Science.

[14]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[15]  M. Borodovsky,et al.  Gene identification in novel eukaryotic genomes by self-training algorithm , 2005, Nucleic acids research.

[16]  Ying Wang,et al.  Insights into social insects from the genome of the honeybee Apis mellifera , 2006, Nature.

[17]  G. Weinstock,et al.  Creating a honey bee consensus gene set , 2007, Genome Biology.

[18]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[19]  S. Ranade,et al.  Stem cell transcriptome profiling via massive-scale mRNA sequencing , 2008, Nature Methods.

[20]  M. Borodovsky,et al.  Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. , 2008, Genome research.

[21]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[22]  Marcel H. Schulz,et al.  A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome , 2008, Science.

[23]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[24]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[25]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[26]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[27]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[28]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[29]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[30]  J. Derisi,et al.  HMMSplicer: A Tool for Efficient and Sensitive Discovery of Known and Novel Splice Junctions in RNA-Seq Data , 2010, PloS one.

[31]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[32]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[33]  T. Nilsen,et al.  Expansion of the eukaryotic proteome by alternative splicing , 2010, Nature.

[34]  Weng-Keen Wong,et al.  Gene expression Advance Access publication April 21, 2010 Supersplat—spliced RNA-seq alignment , 2009 .

[35]  Joseph K. Pickrell,et al.  Noisy Splicing Drives mRNA Isoform Diversity in Human Cells , 2010, PLoS genetics.

[36]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[37]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[38]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[39]  Steven J. M. Jones,et al.  De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[40]  James A. Eddy,et al.  Behavior-specific changes in transcriptional modules lead to distinct and predictable neurogenomic states , 2011, Proceedings of the National Academy of Sciences.

[41]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[42]  Susan J. Brown,et al.  Creating a buzz about insect genomes. , 2011, Science.

[43]  R. Moritz,et al.  Alternative splicing of a single transcription factor drives selfish reproductive behavior in honeybee workers (Apis mellifera) , 2011, Proceedings of the National Academy of Sciences.

[44]  Xuegong Zhang,et al.  Observations on novel splice junctions from RNA sequencing data. , 2011, Biochemical and biophysical research communications.

[45]  Tao Jiang,et al.  IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly - (Extended Abstract) , 2011, RECOMB.

[46]  R. Guigó,et al.  Estimation of alternative splicing variability in human populations. , 2012, Genome research.

[47]  G. Robinson,et al.  DNA methylation dynamics, metabolic fluxes, gene splicing, and alternative phenotypes in honey bees , 2012, Proceedings of the National Academy of Sciences.

[48]  Zhengzheng S Liang,et al.  Molecular Determinants of Scouting Behavior in Honey Bees , 2012, Science.

[49]  Zhengzheng S Liang,et al.  The Transcription Factor Ultraspiracle Influences Honey Bee Social Behavior and Behavior-Related Gene Expression , 2012, PLoS genetics.

[50]  Kai Ye,et al.  PASSion: a pattern growth algorithm-based pipeline for splice junction detection in paired-end RNA-Seq data , 2012, Bioinform..