Improved Performance of Gene Set Analysis on Genome-Wide Transcriptomics Data when Using Gene Activity State Estimates

Gene set analysis methods continue to be a popular and powerful method of evaluating genome-wide transcriptomics data. These approach require a priori grouping of genes into biologically meaningful sets, and then conducting downstream analyses at the set (instead of gene) level of analysis. Gene set analysis methods have been shown to yield more powerful statistical conclusions than single-gene analyses due to both reduced multiple testing penalties and potentially larger observed effects due to the aggregation of effects across multiple genes in the set. Traditionally, gene set analysis methods have been applied directly to normalized, log-transformed, transcriptomics data. Recently, efforts have been made to transform transcriptomics data to scales yielding more biologically interpretable results. For example, recently proposed models transform log-transformed transcriptomics data to a confidence metric (ranging between 0 and 100%) that a gene is active (roughly speaking, that the gene product is part of an active cellular mechanism). In this manuscript, we demonstrate, on both real and simulated transcriptomics data, that tests for differential expression between sets of genes using are typically more powerful when using gene activity state estimates as opposed to log-transformed gene expression data. Our analysis suggests further exploration of techniques to transform transcriptomics data to meaningful quantities for improved downstream inference.

[1]  Craig Disselkoen,et al.  A Bayesian Framework for the Classification of Microbial Gene Activity States , 2016, Front. Microbiol..

[2]  J. Collins,et al.  Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles , 2007, PLoS biology.

[3]  Jessica Andrea Carballido,et al.  Discretization of gene expression data revised , 2016, Briefings Bioinform..

[4]  Matthew Zawistowski,et al.  A Geometric Framework for Evaluating Rare Variant Tests of Association , 2013, Genetic epidemiology.

[5]  A. Raftery Bayesian Model Selection in Social Research , 1995 .

[6]  Jeremiah J. Faith,et al.  Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata , 2007, Nucleic Acids Res..

[7]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[8]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[9]  Pia Abel zur Wiesch,et al.  Bi-modal Distribution of the Second Messenger c-di-GMP Controls Cell Fate and Asymmetry during the Caulobacter Cell Cycle , 2013, PLoS genetics.

[10]  Matthew DeJongh,et al.  Evaluating the consistency of gene sets used in the analysis of bacterial gene expression data , 2011, BMC Bioinformatics.

[11]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[12]  Matthew DeJongh,et al.  Gene set analyses for interpreting microarray experiments on prokaryotic organisms , 2008, BMC Bioinformatics.

[13]  Scott Powers,et al.  Cautions about the reliability of pairwise gene correlations based on expression data , 2015, Front. Microbiol..

[14]  T. Heskes,et al.  The statistical properties of gene-set analysis , 2016, Nature Reviews Genetics.

[15]  J. Ferrell Self-perpetuating states in signal transduction: positive feedback, double-negative feedback and bistability. , 2002, Current opinion in cell biology.

[16]  Katherine H. Huang,et al.  A novel method for accurate operon predictions in all sequenced prokaryotes , 2005, Nucleic acids research.

[17]  Wei Li Analyzing Gene Expression Data in Terms of Gene Sets: Gene Set Enrichment Analysis , 2009 .