Gene ontology analysis for RNA-seq: accounting for selection bias

We present GOseq, an application for performing Gene Ontology (GO) analysis on RNA-seq data. GO analysis is widely used to reduce complexity and highlight biological processes in genome-wide expression studies, but standard methods give biased results on RNA-seq data due to over-detection of differential expression for long and highly expressed transcripts. Application of GOseq to a prostate cancer data set shows that GOseq dramatically changes the results, highlighting categories more consistent with the known biology.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[3]  D. Feldman,et al.  The development of androgen-independent prostate cancer , 2001, Nature Reviews Cancer.

[4]  May D. Wang,et al.  GoMiner: a resource for biological interpretation of genomic and proteomic data , 2003, Genome Biology.

[5]  Robert Tibshirani,et al.  Transcriptional programs activated by exposure of human prostate cancer cells to androgen , 2002, Genome Biology.

[6]  David Martin,et al.  GOToolBox: functional analysis of gene datasets based on Gene Ontology , 2004, Genome Biology.

[7]  T. Speed,et al.  GOstat: find statistically overrepresented Gene Ontologies within a group of genes. , 2004, Bioinformatics.

[8]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[9]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[11]  Paul G. Spirakis,et al.  Weighted random sampling with a reservoir , 2006, Inf. Process. Lett..

[12]  Thomas Lengauer,et al.  Improved scoring of functional groups from gene expression data by decorrelating GO graph structure , 2006, Bioinform..

[13]  Zhen Su,et al.  EasyGO: Gene Ontology-based annotation and functional enrichment analysis tool for agronomical species , 2007, BMC Genomics.

[14]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[15]  R. Vossen,et al.  Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms , 2008, Nucleic acids research.

[16]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[17]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[18]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[19]  E. Mardis,et al.  Transcriptome-Wide Identification of Novel Imprinted Genes in Neonatal Mouse Brain , 2008, PloS one.

[20]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[21]  Gene W. Yeo,et al.  Determination of tag density required for digital transcriptome analysis: Application to an androgen-sensitive prostate cancer model , 2008, Proceedings of the National Academy of Sciences.

[22]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[23]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[24]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[25]  P. Khaitovich,et al.  BMC Genomics BioMed Central Methodology article Estimating accuracy of RNA-Seq and microarrays with proteomics , 2022 .

[26]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[27]  Mats Ensterö,et al.  Large-scale mRNA sequencing determines global regulation of RNA editing during brain development. , 2009, Genome research.

[28]  A. Oshlack,et al.  Transcript length bias in RNA-seq data confounds systems biology , 2009, Biology Direct.

[29]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..