A Decision‐Theory Approach to Interpretable Set Analysis for High‐Dimensional Data

A key problem in high-dimensional significance analysis is to find pre-defined sets that show enrichment for a statistical signal of interest; the classic example is the enrichment of gene sets for differentially expressed genes. Here, we propose a new decision-theory approach to the analysis of gene sets which focuses on estimating the fraction of non-null variables in a set. We introduce the idea of "atoms," non-overlapping sets based on the original pre-defined set annotations. Our approach focuses on finding the union of atoms that minimizes a weighted average of the number of false discoveries and missed discoveries. We introduce a new false discovery rate for sets, called the atomic false discovery rate (afdr), and prove that the optimal estimator in our decision-theory framework is to threshold the afdr. These results provide a coherent and interpretable framework for the analysis of sets that addresses the key issues of overlapping annotations and difficulty in interpreting p values in both competitive and self-contained tests. We illustrate our method and compare it to a popular existing method using simulated examples, as well as gene-set and brain ROI data analyses.

[1]  Scott L. Zeger,et al.  The Analysis of Gene Expression Data: An Overview of Methods and Software , 2003 .

[2]  Philip E. Gill,et al.  Practical optimization , 1981 .

[3]  Karl J. Friston,et al.  Statistical parametric mapping , 2013 .

[4]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[5]  Christina Kendziorski,et al.  Parametric Empirical Bayes Methods for Microarrays , 2003 .

[6]  Deepayan Sarkar,et al.  Detecting differential gene expression with a semiparametric hierarchical mixture method. , 2004, Biostatistics.

[7]  Scott L. Zeger,et al.  The Analysis of Gene Expression Data: Methods and Software , 2013 .

[8]  Y. Benjamini,et al.  Screening for Partial Conjunction Hypotheses , 2008, Biometrics.

[9]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[10]  J. Anderson,et al.  Penalized maximum likelihood estimation in logistic regression and discrimination , 1982 .

[11]  C. Lawrence,et al.  Centroid estimation in discrete high-dimensional spaces with applications in biology , 2008, Proceedings of the National Academy of Sciences.

[12]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[13]  Belinda Phipson,et al.  Opposing roles of polycomb repressive complexes in hematopoietic stem and progenitor cells. , 2010, Blood.

[14]  N. Tzourio-Mazoyer,et al.  Automated Anatomical Labeling of Activations in SPM Using a Macroscopic Anatomical Parcellation of the MNI MRI Single-Subject Brain , 2002, NeuroImage.

[15]  Peter N. Robinson,et al.  GOing Bayesian: model-based gene set analysis of genome-scale data , 2010, Nucleic acids research.

[16]  Zhen Jiang,et al.  Bioconductor Project Bioconductor Project Working Papers Year Paper Extensions to Gene Set Enrichment , 2013 .

[17]  B. Silverman,et al.  Nonparametric Regression and Generalized Linear Models: A roughness penalty approach , 1993 .

[18]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[19]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[20]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[21]  Pat Levitt,et al.  Molecular Characterization of Schizophrenia Viewed by Microarray Analysis of Gene Expression in Prefrontal Cortex , 2000, Neuron.

[22]  John Quackenbush,et al.  Computational genetics: Computational analysis of microarray data , 2001, Nature Reviews Genetics.

[23]  R. Gottardo,et al.  Statistical analysis of microarray data: a Bayesian approach. , 2003, Biostatistics.

[24]  P. Müller,et al.  Optimal Sample Size for Multiple Testing , 2004 .

[25]  Sanat K. Sarkar,et al.  Controlling Bayes directional false discovery rate in random effects model , 2008 .

[26]  Di Wu,et al.  ROAST: rotation gene set tests for complex microarray experiments , 2010, Bioinform..

[27]  Wenguang Sun,et al.  Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks , 2009 .

[28]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[29]  L. O’Driscoll Gene Expression Profiling , 2011, Methods in Molecular Biology.

[30]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[31]  D. Cox Nonparametric Regression and Generalized Linear Models: A roughness penalty approach , 1993 .

[32]  M. J. van de Vijver,et al.  Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. , 2006, Journal of the National Cancer Institute.

[33]  Matthew E Ritchie,et al.  Integrative analysis of RUNX1 downstream pathways and target genes , 2008, BMC Genomics.

[34]  B. Silverman,et al.  Nonparametric regression and generalized linear models , 1994 .

[35]  X. Cui,et al.  Improved statistical tests for differential gene expression by shrinking variance components estimates. , 2005, Biostatistics.

[36]  R. Tibshirani,et al.  Empirical bayes methods and false discovery rates for microarrays , 2002, Genetic epidemiology.

[37]  Gregory R. Grant,et al.  A flexible two-stage procedure for identifying gene sets that are differentially expressed , 2009, Bioinform..

[38]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[39]  Wei Li Analyzing Gene Expression Data in Terms of Gene Sets: Gene Set Enrichment Analysis , 2009 .

[40]  Jonathan Pevsner,et al.  DRAGON View: information visualization for annotated microarray data , 2002, Bioinform..

[41]  G. Parmigiani,et al.  A statistical framework for expression‐based molecular classification in cancer , 2002 .

[42]  J. Uhm An Integrated Genomic Analysis of Human Glioblastoma Multiforme , 2009 .

[43]  John D. Storey,et al.  Multiple Locus Linkage Analysis of Genomewide Expression in Yeast , 2005, PLoS biology.

[44]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[45]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[47]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[48]  John D. Storey A direct approach to false discovery rates , 2002 .

[49]  Paul J. Laurienti,et al.  An automated method for neuroanatomic and cytoarchitectonic atlas-based interrogation of fMRI data sets , 2003, NeuroImage.

[50]  Rafael A Irizarry,et al.  Gene set enrichment analysis made simple , 2009, Statistical methods in medical research.

[51]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[52]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[53]  T. Shallice,et al.  Face repetition effects in implicit and explicit memory tests as measured by fMRI. , 2002, Cerebral cortex.

[54]  G Parmigiani,et al.  Androgen-induced programs for prostate epithelial growth and invasion arise in embryogenesis and are reactivated in cancer , 2008, Oncogene.

[55]  Yoav Benjamini,et al.  Associating quantitative behavioral traits with gene expression in the brain: searching for diamonds in the hay , 2007, Bioinform..

[56]  P. Müller,et al.  A Bayesian mixture model for differential gene expression , 2005 .

[57]  Kenneth Rice,et al.  FDR and Bayesian Multiple Comparisons Rules , 2006 .

[58]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[59]  John D. Storey,et al.  SAM Thresholding and False Discovery Rates for Detecting Differential Gene Expression in DNA Microarrays , 2003 .