bcGST - an interactive bias-correction method to identify over-represented gene-sets in boutique arrays

Motivation Gene annotation and pathway databases such as Gene Ontology and Kyoto Encyclopaedia of Genes and Genomes are important tools in Gene‐Set Test (GST) that describe gene biological functions and associated pathways. GST aims to establish an association relationship between a gene‐set of interest and an annotation. Importantly, GST tests for over‐representation of genes in an annotation term. One implicit assumption of GST is that the gene expression platform captures the complete or a very large proportion of the genome. However, this assumption is neither satisfied for the increasingly popular boutique array nor the custom designed gene expression profiling platform. Specifically, conventional GST is no longer appropriate due to the gene‐set selection bias induced during the construction of these platforms. Results We propose bcGST, a bias‐corrected GST by introducing bias‐correction terms in the contingency table needed for calculating the Fisher's Exact Test. The adjustment method works by estimating the proportion of genes captured on the array with respect to the genome in order to assist filtration of annotation terms that would otherwise be falsely included or excluded. We illustrate the practicality of bcGST and its stability through multiple differential gene expression analyses in melanoma and the Cancer Genome Atlas cancer studies. Availability and implementation The bcGST method is made available as a Shiny web application at http://shiny.maths.usyd.edu.au/bcGST/. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Michael Y. Galperin,et al.  The 2015 Nucleic Acids Research Database Issue and Molecular Biology Database Collection , 2014, Nucleic Acids Res..

[2]  Fan Zhang,et al.  In Silico Prediction of Synthetic Lethality by Meta-Analysis of Genetic Interactions, Functions, and Pathways in Yeast and Human Cancer , 2014, Cancer informatics.

[3]  Rafael A Irizarry,et al.  Gene set enrichment analysis made simple , 2009, Statistical methods in medical research.

[4]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[5]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[6]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[7]  Jiang Li,et al.  Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data , 2013, PloS one.

[8]  Wei Shi,et al.  Optimizing the noise versus bias trade-off for Illumina whole genome expression BeadChips , 2010, Nucleic acids research.

[9]  Di Wu,et al.  ROAST: rotation gene set tests for complex microarray experiments , 2010, Bioinform..

[10]  I. Nookaew,et al.  A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae , 2012, Nucleic acids research.

[11]  Israel Steinfeld,et al.  BMC Bioinformatics BioMed Central , 2008 .

[12]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[13]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[14]  Vivek Jayaswal,et al.  Disturbed protein–protein interaction networks in metastatic melanoma are associated with worse prognosis and increased functional mutation burden , 2013, Pigment cell & melanoma research.

[15]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.

[16]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[17]  David G. Robinson,et al.  A nested parallel experiment demonstrates differences in intensity-dependence between RNA-seq and microarrays , 2014, bioRxiv.

[18]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[19]  Christina Backes,et al.  GeneTrail—advanced gene set enrichment analysis , 2007, Nucleic Acids Res..

[20]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[21]  X. Chen,et al.  Comparison of Nanostring nCounter® Data on FFPE Colon Cancer Samples and Affymetrix Microarray Data on Matched Frozen Tissues , 2016, PloS one.

[22]  Qi Zheng,et al.  GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis , 2008, Nucleic Acids Res..

[23]  Helga Thorvaldsdóttir,et al.  Molecular signatures database (MSigDB) 3.0 , 2011, Bioinform..