AltHapAlignR: improved accuracy of RNA-seq analyses through the use of alternative haplotypes

Motivation: Reliance on mapping to a single reference haplotype currently limits accurate estimation of allele or haplotype‐specific expression using RNA‐sequencing, notably in highly polymorphic regions such as the major histocompatibility complex. Results: We present AltHapAlignR, a method incorporating alternate reference haplotypes to generate gene‐ and haplotype‐level estimates of transcript abundance for any genomic region where such information is available. We validate using simulated and experimental data to quantify input allelic ratios for major histocompatibility complex haplotypes, demonstrating significantly improved correlation with ground truth estimates of gene counts compared to standard single reference mapping. We apply AltHapAlignR to RNA‐seq data from 462 individuals, showing how significant underestimation of expression of the majority of classical human leukocyte antigen genes using conventional mapping can be corrected using AltHapAlignR to allow more accurate quantification of gene expression for individual alleles and haplotypes. Availability and implementation: Source code freely available at https://github.com/jknightlab/AltHapAlignR. Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  S. Batzoglou,et al.  Linking disease associations with regulatory information in the human genome , 2012, Genome research.

[2]  S. Gabriel,et al.  Discovery and saturation analysis of cancer genes across 21 tumor types , 2014, Nature.

[3]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[4]  L. Coin,et al.  Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads , 2011, Genome Biology.

[5]  Philippe Moreau,et al.  The 14 bp deletion-insertion polymorphism in the 3' UT region of the HLA-G gene influences HLA-G mRNA stability. , 2003, Human immunology.

[6]  James Robinson,et al.  The IPD and IMGT/HLA database: allele variant databases , 2014, Nucleic Acids Res..

[7]  Cisca Wijmenga,et al.  The MHC locus and genetic susceptibility to autoimmune and infectious diseases , 2017, Genome Biology.

[8]  Raphael Carapito,et al.  Next-Generation Sequencing of the HLA locus: Methods and impacts on HLA typing, population genetics and disease association studies. , 2016, Human immunology.

[9]  Richard Durbin,et al.  Extending reference assembly models , 2015, Genome Biology.

[10]  M. Ni,et al.  Inference of high resolution HLA types using genome-wide RNA or DNA sequencing reads , 2014, BMC Genomics.

[11]  Ming Zhou,et al.  Relative Expression Levels of the HLA Class-I Proteins in Normal and HIV-Infected Cells , 2015, The Journal of Immunology.

[12]  M. Gerstein,et al.  AlleleSeq: analysis of allele-specific expression and binding in a network framework , 2011, Molecular systems biology.

[13]  Knut Reinert,et al.  Alignment of Next-Generation Sequencing Reads. , 2015, Annual review of genomics and human genetics.

[14]  Vitor R. C. Aguiar,et al.  Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data , 2014, G3: Genes, Genomes, Genetics.

[15]  David Heckerman,et al.  Influence of HLA-C Expression Level on HIV Control , 2013, Science.

[16]  Xiaoquan Wen,et al.  QuASAR: Quantitative Allele Specific Analysis of Reads , 2014, bioRxiv.

[17]  C. L. Baker,et al.  PRDM9 Drives Evolutionary Erosion of Hotspots in Mus musculus through Haplotype-Specific Initiation of Meiotic Recombination , 2015, PLoS genetics.

[18]  Wei Sun,et al.  A Statistical Framework for eQTL Mapping Using RNA‐seq Data , 2012, Biometrics.

[19]  Christian Schlötterer,et al.  Allelic imbalance metre (Allim), a new tool for measuring allele-specific gene expression with RNA-seq data , 2013, Molecular ecology resources.

[20]  J. Knight,et al.  Approaches for establishing the function of regulatory genetic variants involved in disease , 2014, Genome Medicine.

[21]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[22]  M. Robinson,et al.  Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences , 2015, F1000Research.

[23]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[24]  H. Yang,et al.  Typing and copy number determination for HLA‐DRB3, ‐DRB4 and ‐DRB5 from next‐generation sequencing data , 2017, HLA.

[25]  P. Bentzen,et al.  Critical review of NGS analyses for de novo genotyping multigene families , 2014, Molecular ecology.

[26]  Evan E. Eichler,et al.  Genetic variation and the de novo assembly of human genomes , 2015, Nature Reviews Genetics.

[27]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[28]  Katharine Plant,et al.  Fine mapping genetic determinants of the highly variably expressed MHC gene ZFP57 , 2013, European Journal of Human Genetics.

[29]  Martin S. Taylor,et al.  Pervasive haplotypic variation in the spliceo-transcriptome of the human major histocompatibility complex. , 2011, Genome research.

[30]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[31]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[32]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[33]  James G. R. Gilbert,et al.  Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project , 2008, Immunogenetics.

[34]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[35]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[36]  M. Robinson,et al.  Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. , 2015, F1000Research.

[37]  John C. Marioni,et al.  Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data , 2009, Bioinform..

[38]  Andrew R. Jones,et al.  Allele frequency net 2015 update: new features for HLA epitopes, KIR and disease and HLA adverse drug reaction associations , 2014, Nucleic Acids Res..

[39]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[40]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.