A new strategy to reduce allelic bias in RNA-Seq readmapping

Accurate estimation of expression levels from RNA-Seq data entails precise mapping of the sequence reads to a reference genome. Because the standard reference genome contains only one allele at any given locus, reads overlapping polymorphic loci that carry a non-reference allele are at least one mismatch away from the reference and, hence, are less likely to be mapped. This bias in read mapping leads to inaccurate estimates of allele-specific expression (ASE). To address this read-mapping bias, we propose the construction of an enhanced reference genome that includes the alternative alleles at known polymorphic loci. We show that mapping to this enhanced reference reduced the read-mapping biases, leading to more reliable estimates of ASE. Experiments on simulated data show that the proposed strategy reduced the number of loci with mapping bias by ≥63% when compared with a previous approach that relies on masking the polymorphic loci and by ≥18% when compared with the standard approach that uses an unaltered reference. When we applied our strategy to actual RNA-Seq data, we found that it mapped up to 15% more reads than the previous approaches and identified many seemingly incorrect inferences made by them.

[1]  Bert Vogelstein,et al.  Allelic Variation in Human Gene Expression , 2002, Science.

[2]  J. Knight,et al.  Allele-specific gene expression uncovered. , 2004, Trends in genetics : TIG.

[3]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[4]  L. Kruglyak,et al.  Simultaneous genotyping, gene-expression measurement, and detection of allele-specific expression with oligonucleotide arrays. , 2005, Genome research.

[5]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[6]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[7]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[8]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[9]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[10]  Jessica Nordlund,et al.  Allele-specific gene expression patterns in primary leukemic cells reveal regulation of gene expression by CpG site methylation. , 2009, Genome research.

[11]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[12]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[13]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[14]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[15]  John C. Marioni,et al.  Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data , 2009, Bioinform..

[16]  D. Clayton,et al.  Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing , 2009, Human molecular genetics.

[17]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[18]  L. Coin,et al.  Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads , 2011, Genome Biology.

[19]  Schraga Schwartz,et al.  Detection and Removal of Biases in the Analysis of Next-Generation Sequencing Reads , 2011, PloS one.

[20]  M. Gerstein,et al.  AlleleSeq: analysis of allele-specific expression and binding in a network framework , 2011, Molecular systems biology.