iMapSplice: Alleviating reference bias through personalized RNA-seq alignment

Genomic variants in both coding and non-coding sequences can have functionally important and sometimes deleterious effects on exon splicing of gene transcripts. For transcriptome profiling using RNA-seq, the accurate alignment of reads across exon junctions is a critical step. Existing algorithms that utilize a standard reference genome as a template sometimes have difficulty in mapping reads that carry genomic variants. These problems can lead to allelic ratio biases and the failure to detect splice variants created by splice site polymorphisms. To improve RNA-seq read alignment, we have developed a novel approach called iMapSplice that enables personalized mRNA transcriptome profiling. The algorithm makes use of personal genomic information and performs an unbiased alignment towards genome indices carrying both reference and alternative bases. Importantly, this breaks the dependency on reference genome splice site dinucleotide motifs and enables iMapSplice to discover personal splice junctions created through splice site polymorphisms. We report comparative analyses using a number of simulated and real datasets. Besides general improvements in read alignment and splice junction discovery, iMapSplice greatly alleviates allelic ratio biases and unravels many previously uncharacterized splice junctions created by splice site polymorphisms, with minimal overhead in computation time and storage. Software download URL: https://github.com/LiuBioinfo/iMapSplice.

[1]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[2]  Vitor R. C. Aguiar,et al.  Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data , 2014, G3: Genes, Genomes, Genetics.

[3]  A global reference for human genetic variation , 2015, Nature.

[4]  Jordan M. Eizenga,et al.  Genome graphs and the evolution of genome inference , 2017, bioRxiv.

[5]  Alison M. Meynert,et al.  Variant detection sensitivity and biases in whole genome and exome sequencing , 2014, BMC Bioinformatics.

[6]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[7]  Eric Banks,et al.  Tools and best practices for data processing in allelic expression analysis , 2015, Genome Biology.

[8]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[9]  Fan Zhang,et al.  Novel alternative splicing isoform biomarkers identification from high-throughput plasma proteomics profiling of breast cancer , 2013, BMC Systems Biology.

[10]  T. Cooper,et al.  The pathobiology of splicing , 2010, The Journal of pathology.

[11]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[12]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[13]  David P. Doane,et al.  Measuring Skewness: A Forgotten Statistic? , 2011 .

[14]  Gil Ast,et al.  Alternative splicing and disease , 2008, RNA biology.

[15]  Daehwan Kim,et al.  HISAT-genotype: Next Generation Genomic Analysis Platform on a Personal Computer , 2018, bioRxiv.

[16]  The Cancer Genome Atlas Research Network COMPREHENSIVE MOLECULAR CHARACTERIZATION OF CLEAR CELL RENAL CELL CARCINOMA , 2013, Nature.

[17]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[18]  C. Ponting,et al.  G&T-seq: parallel sequencing of single-cell genomes and transcriptomes , 2015, Nature Methods.

[19]  Juw Won Park,et al.  Discover hidden splicing variations by mapping personal transcriptomes to personal genomes , 2015, Nucleic acids research.

[20]  Steven J. M. Jones,et al.  Comprehensive genomic characterization of squamous cell lung cancers , 2012, Nature.

[21]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of clear cell renal cell carcinoma , 2013, Nature.

[22]  Brian P. Brunk,et al.  Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM) , 2011, Bioinform..

[23]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[24]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[25]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[26]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[27]  S. Stamm,et al.  Alternative splicing and disease. , 2009, Biochimica et biophysica acta.

[28]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[29]  Matthew A. Hibbs,et al.  RNA-Seq Alignment to Individualized Genomes Improves Transcript Abundance Estimates in Multiparent Populations , 2014, Genetics.