Nonparametric false discovery rate control for identifying simultaneous signals

It is frequently of interest to jointly analyze multiple sequences of multiple tests in order to identify simultaneous signals, defined as features tested in two or more independent studies that are significant in each. For example, researchers often wish to discover genetic variants that are significantly associated with multiple traits. This paper proposes a false discovery rate control procedure for identifying simultaneous signals in two studies. A pair of test statistics is available for each feature, and the goal is to identify features for which both are non-null. Error control is difficult due to the composite nature of a non-discovery, as one of the tests in the pair can still be non-null. Very few existing methods have high power while still provably controlling the false discovery rate. This paper proposes a simple, fast, tuning parameter-free nonparametric procedure that can be shown to provide asymptotically conservative false discovery rate control. Surprisingly, the procedure does not require knowledge of either the null or the alternative distributions of the test statistics. In simulations, the proposed method had higher power and better error control than existing procedures. In an analysis of genome-wide association study results from five psychiatric disorders, it identified more pairs of disorders that share simultaneously significant genetic variants, as well as more variants themselves, compared to other methods. The proposed method is available in the R package ssa.

[1]  Neil M. Walker,et al.  Statistical Colocalization of Genetic Risk Variants for Related Autoimmune Diseases in the Context of Common Controls , 2015, Nature Genetics.

[2]  Xihong Lin,et al.  The Generalized Higher Criticism for Testing SNP-Set Effects in Genetic Association Studies , 2017, Journal of the American Statistical Association.

[3]  M. Daly,et al.  Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis , 2013, The Lancet.

[4]  R. Nelsen An Introduction to Copulas , 1998 .

[5]  Y. Benjamini,et al.  Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics , 1999 .

[6]  Paul J. Harrison,et al.  The axonal chemorepellant semaphorin 3A is increased in the cerebellum in schizophrenia and may contribute to its synaptic pathology , 2003, Molecular Psychiatry.

[7]  Michael I. Jordan,et al.  A unified treatment of multiple testing with prior knowledge using the p-filter , 2017, The Annals of Statistics.

[8]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[9]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.

[10]  Jian Li,et al.  Higher criticism: $p$-values and criticism , 2014, 1411.1437.

[11]  P. Heesen,et al.  The False Discovery Rate (FDR) of Multiple Tests in a Class Room Lecture , 2015, 1511.07050.

[12]  Lucas Janson,et al.  Panning for gold: ‘model‐X’ knockoffs for high dimensional controlled variable selection , 2016, 1610.02351.

[13]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[14]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[15]  A. Schwartzman Comment: FDP vs FDR and the Effect of Conditioning , 2012 .

[16]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[17]  Hongyu Zhao,et al.  GPA: A Statistical Approach to Prioritizing GWAS Results by Integrating Pleiotropy and Annotation , 2014, PLoS genetics.

[18]  T. Cai,et al.  Estimating the Null and the Proportion of Nonnull Effects in Large-Scale Multiple Comparisons , 2006, math/0611108.

[19]  Deepayan Sarkar,et al.  Detecting differential gene expression with a semiparametric hierarchical mixture method. , 2004, Biostatistics.

[20]  A. Dasgupta Asymptotic Theory of Statistics and Probability , 2008 .

[21]  Wolfgang Huber,et al.  Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size , 2013, Bioinform..

[22]  Marina Bogomolov,et al.  Discovering Findings That Replicate From a Primary Study of High Dimension to a Follow-Up Study , 2012, 1207.0187.

[23]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[24]  B. Kieffer,et al.  Mice Lacking GPR88 Show Motor Deficit, Improved Spatial Learning, and Low Anxiety Reversed by Delta Opioid Antagonist , 2016, Biological Psychiatry.

[25]  D. Rujescu,et al.  Improved Detection of Common Variants Associated with Schizophrenia and Bipolar Disorder Using Pleiotropy-Informed Conditional False Discovery Rate , 2013, PLoS genetics.

[26]  Joseph M. Troy,et al.  Transcriptional regulatory dynamics drive coordinated metabolic and neural response to social challenge in mice. , 2017, Genome research.

[27]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[28]  Jianqing Fan,et al.  Estimating False Discovery Proportion Under Arbitrary Covariance Dependence , 2010, Journal of the American Statistical Association.

[29]  E. Candès,et al.  A knockoff filter for high-dimensional selective inference , 2016, The Annals of Statistics.

[30]  Saurabh Sinha,et al.  Neuromolecular responses to social challenge: Common mechanisms across mouse, stickleback fish, and honey bee , 2014, Proceedings of the National Academy of Sciences.

[31]  E. Drews,et al.  Enkephalin knockout male mice are resistant to chronic mild stress , 2014, Genes, brain, and behavior.

[32]  John D. Storey A direct approach to false discovery rates , 2002 .

[33]  Debashis Ghosh,et al.  Testing the disjunction hypothesis using Voronoi diagrams with applications to genetics , 2013, 1312.5782.

[34]  Aaditya Ramdas,et al.  The p‐filter: multilayer false discovery rate control for grouped hypotheses , 2017 .

[35]  J. Lieberman,et al.  Cross-Disorder Genomewide Analysis of Schizophrenia , Bipolar Disorder , and Depression AJP in Advance , 2010 .

[36]  M. Kosorok,et al.  The optimal power puzzle: scrutiny of the monotone likelihood ratio assumption in multiple testing. , 2013, Biometrika.

[37]  Jianxin Shi,et al.  Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs , 2013, Nature Genetics.

[38]  R. Rao Relations between Weak and Uniform Convergence of Measures with Applications , 1962 .

[39]  Hongzhe Li,et al.  Optimal detection of weak positive dependence between two mixture distributions , 2014, 1412.2149.

[40]  G. Robinson,et al.  Sociogenomics: social life in molecular terms , 2005, Nature Reviews Genetics.

[41]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[42]  Sandrine Dudoit,et al.  Multiple Testing. Part I. Single-Step Procedures for Control of General Type I Error Rates , 2004, Statistical applications in genetics and molecular biology.

[43]  John A. Todd,et al.  Statistical colocalization of monocyte gene expression and genetic risk variants for type 1 diabetes , 2012, Human molecular genetics.

[44]  Ruth Heller,et al.  Replicability analysis for genome-wide association studies , 2012, 1209.2829.

[45]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[46]  J. Ioannidis,et al.  Replication validity of genetic association studies , 2001, Nature Genetics.

[47]  Michael Wolf,et al.  Control of generalized error rates in multiple testing , 2007, 0710.2258.

[48]  F. Agakov,et al.  Abundant pleiotropy in human complex diseases and traits. , 2011, American journal of human genetics.

[49]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[50]  Bradley Efron,et al.  Large-scale inference , 2010 .

[51]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[52]  Yoav Benjamini,et al.  Deciding whether follow-up studies have replicated findings in a preliminary large-scale omics study , 2013, Proceedings of the National Academy of Sciences.

[53]  Armin Schwartzman,et al.  Empirical null and false discovery rate inference for exponential families , 2008, 0901.4007.

[54]  C. Bouchard,et al.  Are there genetic paths common to obesity, cardiovascular disease outcomes, and cardiovascular risk factors? , 2015, Circulation research.

[55]  Ruth Heller,et al.  Assessing replicability of findings across two studies of multiple features , 2015, Biometrika.

[56]  E. Nestler,et al.  diffReps: Detecting Differential Chromatin Modification Sites from ChIP-seq Data with Biological Replicates , 2013, PloS one.

[57]  Peter M Visscher,et al.  Genome-wide association studies and human disease: from trickle to flood. , 2009, JAMA.

[58]  Zhiyi Chi,et al.  False discovery rate control with multivariate p-values , 2007, 0706.0498.

[59]  B. Efron Correlated z-Values and the Accuracy of Large-Scale Statistical Estimates , 2010, Journal of the American Statistical Association.

[60]  D. Pollard Convergence of stochastic processes , 1984 .

[61]  Wenguang Sun,et al.  Oracle and Adaptive Compound Decision Rules for False Discovery Rate Control , 2007 .

[62]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[63]  Julie Kobie,et al.  Sparse Simultaneous Signal Detection With Applications in Genomics , 2016 .

[64]  D. Donoho 50 Years of Data Science , 2017 .

[65]  Janet E. Lainhart,et al.  Comorbid Psychiatric Disorders in Children with Autism: Interview Development and Rates of Disorders , 2006, Journal of autism and developmental disorders.

[66]  Aviv Regev,et al.  Comparative analysis of gene regulatory networks: from network reconstruction to evolution. , 2015, Annual review of cell and developmental biology.

[67]  M. McCarthy,et al.  Improved detection of common variants associated with schizophrenia by leveraging pleiotropy with cardiovascular-disease risk factors. , 2013, American journal of human genetics.

[68]  Jon A. Wellner,et al.  A Glivenko-Cantelli theorem for empirical measures of independent but non-identically distributed random variables , 1981 .

[69]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[70]  Lilun Du,et al.  Single-index modulated multiple testing , 2014, 1407.0185.

[71]  Christian P. Robert,et al.  Large-scale inference , 2010 .

[72]  E. Candès,et al.  Controlling the false discovery rate via knockoffs , 2014, 1404.5609.

[73]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .

[74]  Hongzhe Li,et al.  Sparse Simultaneous Signal Detection for Identifying Genetically Controlled Disease Genes , 2017, Journal of the American Statistical Association.

[75]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[76]  Yogendra P. Chaubey Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[77]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[78]  S. Sarkar On Methods Controlling the False Discovery Rate 1 , 2009 .

[79]  Jiashun Jin,et al.  Estimation and Confidence Sets for Sparse Normal Mixtures , 2006, math/0612623.

[80]  E. Arias-Castro,et al.  Distribution-free Multiple Testing , 2016, 1604.07520.

[81]  L. Wasserman,et al.  A stochastic process approach to false discovery control , 2004, math/0406519.