Assessing the impact of human genome annotation choice on RNA-seq expression estimates

BackgroundGenome annotation is a crucial component of RNA-seq data analysis. Much effort has been devoted to producing an accurate and rational annotation of the human genome. An annotated genome provides a comprehensive catalogue of genomic functional elements. Currently, at least six human genome annotations are publicly available, including AceView Genes, Ensembl Genes, H-InvDB Genes, RefSeq Genes, UCSC Known Genes, and Vega Genes. Characteristics of these annotations differ because of variations in annotation strategies and information sources. When performing RNA-seq data analysis, researchers need to choose a genome annotation. However, the effect of genome annotation choice on downstream RNA-seq expression estimates is still unclear. This study (1) investigates the effect of different genome annotations on RNA-seq quantification and (2) provides guidelines for choosing a genome annotation based on research focus.ResultsWe define the complexity of human genome annotations in terms of the number of genes, isoforms, and exons. This definition facilitates an investigation of potential relationships between complexity and variations in RNA-seq quantification. We apply several evaluation metrics to demonstrate the impact of genome annotation choice on RNA-seq expression estimates. In the mapping stage, the least complex genome annotation, RefSeq Genes, appears to have the highest percentage of uniquely mapped short sequence reads. In the quantification stage, RefSeq Genes results in the most stable expression estimates in terms of the average coefficient of variation over all genes. Stable expression estimates in the quantification stage translate to accurate statistics for detecting differentially expressed genes. We observe that RefSeq Genes produces the most accurate fold-change measures with respect to a ground truth of RT-qPCR gene expression estimates.ConclusionsBased on the observed variations in the mapping, quantification, and differential expression calling stages, we demonstrate that the selection of human genome annotation results in different gene expression estimates. When conducting research that emphasizes reproducible and robust gene expression estimates, a less complex genome annotation may be preferred. However, simpler genome annotations may limit opportunities for identifying or characterizing novel transcriptional or regulatory mechanisms. When conducting research that aims to be more exploratory, a more complex genome annotation may be preferred.

[1]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[2]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[3]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[4]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[5]  S. Ye,et al.  RNA-seq Reveals Novel Transcriptome of Genes and Their Isoforms in Human Pulmonary Microvascular Endothelial Cells Treated with Thrombin , 2012, PloS one.

[6]  Krishna R. Kalari,et al.  Deep Sequence Analysis of Non-Small Cell Lung Cancer: Integrated Analysis of Gene Expression, Alternative Splicing, and Single Nucleotide Variations in Lung Adenocarcinomas with and without Oncogenic KRAS Mutations , 2012, Front. Oncol..

[7]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[8]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[9]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[10]  James G. R. Gilbert,et al.  The vertebrate genome annotation (Vega) database , 2004, Nucleic Acids Res..

[11]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[12]  Lincoln Stein,et al.  Genome annotation: from sequence to biology , 2001, Nature Reviews Genetics.

[13]  David Haussler,et al.  The UCSC Known Genes , 2006, Bioinform..

[14]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[15]  Xiang-Dong Fu,et al.  Versatile pathway-centric approach based on high-throughput sequencing to anticancer drug discovery , 2012, Proceedings of the National Academy of Sciences.

[16]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[17]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[18]  James G. R. Gilbert,et al.  The Vertebrate Genome Annotation (Vega) database , 2004, Nucleic acids research.

[19]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[20]  Jun Hu,et al.  OSA: a fast and accurate alignment tool for RNA-Seq , 2012, Bioinform..

[21]  May D. Wang,et al.  The effect of human genome annotation complexity on RNA-Seq gene expression quantification , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops.

[22]  G C Overton,et al.  Analysis of EST-driven gene annotation in human genomic sequence. , 1998, Genome research.

[23]  Gautier Koscielny,et al.  Ensembl 2012 , 2011, Nucleic Acids Res..

[24]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[25]  J. Thierry-Mieg,et al.  AceView: a comprehensive cDNA-supported gene and transcripts annotation , 2006, Genome Biology.

[26]  M. Marra,et al.  Applications of next-generation sequencing technologies in functional genomics. , 2008, Genomics.

[27]  E. Olson,et al.  MicroRNA regulatory networks in cardiovascular development. , 2010, Developmental cell.

[28]  Teruyoshi Hishiki,et al.  The H-Invitational Database (H-InvDB), a comprehensive annotation resource for human genes and transcripts , 2007, Nucleic Acids Res..