NaRnEA: An Information Theoretic Framework for Gene Set Analysis

Gene sets are being increasingly leveraged to make high-level biological inferences from transcriptomic data; however, existing gene set analysis methods rely on overly conservative, heuristic approaches for quantifying the statistical significance of gene set enrichment. We created Nonparametric analytical-Rank-based Enrichment Analysis (NaRnEA) to facilitate accurate and robust gene set analysis with an optimal null model derived using the information theoretic Principle of Maximum Entropy. By measuring the differential activity of ~2500 transcriptional regulatory proteins based on the differential expression of each protein’s transcriptional targets between primary tumors and normal tissue samples in three cohorts from The Cancer Genome Atlas (TCGA), we demonstrate that NaRnEA critically improves in two widely used gene set analysis methods: Gene Set Enrichment Analysis (GSEA) and analytical-Rank-based Enrichment Analysis (aREA). We show that the NaRnEA-inferred differential protein activity is significantly correlated with differential protein abundance inferred from independent, phenotype-matched mass spectrometry data in the Clinical Proteomic Tumor Analysis Consortium (CPTAC), confirming the statistical and biological accuracy of our approach. Additionally, our analysis crucially demonstrates that the sample-shuffling empirical null models leveraged by GSEA and aREA for gene set analysis are overly conservative, a shortcoming that is avoided by the newly developed Maximum Entropy analytical null model employed by NaRnEA.

[1]  Xinzheng V. Guo,et al.  Single-cell protein activity analysis identifies recurrence-associated renal tumor macrophages , 2021, Cell.

[2]  Evan O. Paull,et al.  A modular master regulator landscape controls cancer transcriptional identity , 2021, Cell.

[3]  Anthony J. Kusalik,et al.  Gene Set Analysis: Challenges, Opportunities, and Future Research , 2020, Frontiers in Genetics.

[4]  Shesh N. Rai,et al.  Fifteen Years of Gene Set Analysis for High-Throughput Genomic Data: A Review of Statistical Approaches and Future Challenges , 2020, Entropy.

[5]  D Mercatelli,et al.  Gene regulatory network inference resources: A practical overview. , 2020, Biochimica et biophysica acta. Gene regulatory mechanisms.

[6]  M. Dimopoulos,et al.  Oral Selinexor-Dexamethasone for Triple-Class Refractory Multiple Myeloma. , 2019, The New England journal of medicine.

[7]  Dorothy Bishop Rein in the four horsemen of irreproducibility , 2019, Nature.

[8]  Lana S. Martin,et al.  Systematic benchmarking of omics computational tools , 2019, Nature Communications.

[9]  Cheng Hu Central limit theorems for sub-linear expectation under the Lindeberg condition , 2018, Journal of Inequalities and Applications.

[10]  Paul A Clemons,et al.  A precision oncology approach to the pharmacological targeting of mechanistic dependencies in neuroendocrine tumors , 2018, Nature Genetics.

[11]  Mariano J. Alvarez,et al.  Quantitative assessment of protein activity in orphan tissues and single cells using the metaVIPER algorithm , 2018, Nature Communications.

[12]  Jing Wang,et al.  LinkedOmics: analyzing multi-omics data within and across 32 cancer types , 2017, Nucleic Acids Res..

[13]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[14]  Andrea Califano,et al.  Systematic, network-based characterization of therapeutic target inhibitors , 2017, PLoS Comput. Biol..

[15]  Mariano J. Alvarez,et al.  The recurrent architecture of tumour initiation, progression and drug sensitivity , 2016, Nature Reviews Cancer.

[16]  A. Califano,et al.  Network-based inference of protein activity helps functionalize the genetic landscape of cancer , 2016, Nature Genetics.

[17]  Andrea Califano,et al.  ARACNe-AP: gene network reverse engineering through adaptive partitioning inference of mutual information , 2016, Bioinform..

[18]  T. Heskes,et al.  The statistical properties of gene-set analysis , 2016, Nature Reviews Genetics.

[19]  Gianluca Bontempi,et al.  TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data , 2015, Nucleic acids research.

[20]  J. Mesirov,et al.  The limitations of simple gene set enrichment analysis assuming gene independence , 2011, J. Biomed. Informatics.

[21]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[22]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[23]  G. Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Permutation P -values Should Never Be Zero: Calculating Exact P -values When Permutations Are Randomly Drawn , 2011 .

[24]  T. Sargent,et al.  The multivariate normal distribution , 1989 .

[25]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[26]  E. Birney,et al.  Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt , 2009, Nature Protocols.

[27]  Chris Wiggins,et al.  ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context , 2004, BMC Bioinformatics.

[28]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[29]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[31]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[32]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[33]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[34]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[35]  B. Efron,et al.  Bootstrap confidence intervals , 1996 .

[36]  Allan Gut,et al.  An intermediate course in probability , 1995 .

[37]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[38]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[39]  B. L. Welch The generalisation of student's problems when several different population variances are involved. , 1947, Biometrika.

[40]  E. S. Pearson,et al.  THE USE OF CONFIDENCE OR FIDUCIAL LIMITS ILLUSTRATED IN THE CASE OF THE BINOMIAL , 1934 .