GAP: A General Framework for Information Pooling in Two-Sample Sparse Inference

Abstract This article develops a general framework for exploiting the sparsity information in two-sample multiple testing problems. We propose to first construct a covariate sequence, in addition to the usual primary test statistics, to capture the sparsity structure, and then incorporate the auxiliary covariates in inference via a three-step algorithm consisting of grouping, adjusting and pooling (GAP). The GAP procedure provides a simple and effective framework for information pooling. An important advantage of GAP is its capability of handling various dependence structures such as those arise from high-dimensional linear regression, differential correlation analysis, and differential network analysis. We establish general conditions under which GAP is asymptotically valid for false discovery rate control, and show that these conditions are fulfilled in a range of settings, including testing multivariate normal means, high-dimensional linear regression, differential covariance or correlation matrices, and Gaussian graphical models. Numerical results demonstrate that existing methods can be significantly improved by the proposed framework. The GAP procedure is illustrated using a breast cancer study for identifying gene–gene interactions.

[1]  L. Wasserman,et al.  False discovery control with p-value weighting , 2006 .

[2]  R. Wooster,et al.  Breast cancer genetics: What we know and what we need , 2001, Nature Medicine.

[3]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[4]  S. Sarkar Some Results on False Discovery Rate in Stepwise multiple testing procedures , 2002 .

[5]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[6]  Lilun Du,et al.  Single-index modulated multiple testing , 2014, 1407.0185.

[7]  Weidong Liu,et al.  Hypothesis Testing for High-dimensional Regression Models † , 2014 .

[8]  A. Zaitsev,et al.  On the gaussian approximation of convolutions under multidimensional analogues of S.N. Bernstein's inequality conditions , 1987 .

[9]  Weidong Liu,et al.  Large-Scale Multiple Testing of Correlations , 2016, Journal of the American Statistical Association.

[10]  O. Olopade,et al.  Advances in Breast Cancer: Pathways to Personalized Medicine , 2008, Clinical Cancer Research.

[11]  Tianxi Cai,et al.  Testing Differential Networks with Applications to Detecting Gene-by-Gene Interactions. , 2015, Biometrika.

[12]  Harrison H. Zhou,et al.  False Discovery Rate Control With Groups , 2010, Journal of the American Statistical Association.

[13]  Wenguang Sun,et al.  Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks , 2009 .

[14]  T. Cai,et al.  Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings , 2013 .

[15]  C. Sing,et al.  Complex adaptive systems and human health: the influence of common genotypes of the apolipoprotein E (ApoE) gene polymorphism and age on the relational order within a field of lipid metabolism traits , 2000, Human Genetics.

[16]  James G. Scott,et al.  False Discovery Rate Regression: An Application to Neural Synchrony Detection in Primary Visual Cortex , 2013, Journal of the American Statistical Association.

[17]  T. Cai,et al.  Two-Sample Tests for High-Dimensional Linear Regression with an Application to Detecting Interactions. , 2018 .

[18]  Haavard Rue,et al.  Unsupervised empirical Bayesian multiple testing with external covariates , 2008, 0807.4658.

[19]  Wenguang Sun,et al.  Multiple Testing for Pattern Identification, With Applications to Microarray Time-Course Experiments , 2011 .

[20]  Wenguang Sun,et al.  CARS: Covariate Assisted Ranking and Screening for Large-Scale Two-Sample Inference , 2018 .

[21]  D. Hunter Gene–environment interactions in human diseases , 2005, Nature Reviews Genetics.

[22]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[23]  Weidong Liu Gaussian graphical model estimation with false discovery rate control , 2013, 1306.0976.

[24]  B. Efron SIMULTANEOUS INFERENCE : WHEN SHOULD HYPOTHESIS TESTING PROBLEMS BE COMBINED? , 2008, 0803.3863.

[25]  Jiashun Jin,et al.  Optimal rates of convergence for estimating the null density and proportion of nonnull effects in large-scale multiple testing , 2010, 1001.1609.

[26]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[27]  Susmita Datta,et al.  A statistical framework for differential network analysis from microarray data , 2010, BMC Bioinformatics.

[28]  Avshalom Caspi,et al.  Gene–environment interactions in psychiatry: joining forces with neuroscience , 2006, Nature Reviews Neuroscience.

[29]  Weidong Liu Incorporation of Sparsity Information in Large-scale Multiple Two-sample $t$ Tests , 2014, 1410.4282.

[30]  N. Meinshausen,et al.  Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses , 2005, math/0501289.

[31]  John D. Storey,et al.  Significance analysis of time course microarray experiments. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[32]  John D. Storey A direct approach to false discovery rates , 2002 .

[33]  T. Cai,et al.  Estimating the Null and the Proportion of Nonnull Effects in Large-Scale Multiple Comparisons , 2006, math/0611108.

[34]  Pallavi Basu,et al.  Weighted False Discovery Rate Control in Large-Scale Multiple Testing , 2015, Journal of the American Statistical Association.

[35]  M. West,et al.  Sparse graphical models for exploring gene expression data , 2004 .

[36]  E. Spjøtvoll,et al.  Plots of P-values to evaluate many tests simultaneously , 1982 .

[37]  Wenguang Sun,et al.  Large‐scale multiple testing under dependence , 2009 .

[38]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[39]  B. Lindqvist,et al.  Estimating the proportion of true null hypotheses, with application to DNA microarray data , 2005 .

[40]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[41]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[42]  T. Speed,et al.  A multivariate empirical Bayes statistic for replicated microarray time course data , 2006, math/0702685.