CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING.

We consider large-scale studies in which thousands of significance tests are performed simultaneously. In some of these studies, the multiple testing procedure can be severely biased by latent confounding factors such as batch effects and unmeasured covariates that correlate with both primary variable(s) of interest (e.g., treatment variable, phenotype) and the outcome. Over the past decade, many statistical methods have been proposed to adjust for the confounders in hypothesis testing. We unify these methods in the same framework, generalize them to include multiple primary variables and multiple nuisance variables, and analyze their statistical properties. In particular, we provide theoretical guarantees for RUV-4 [Gagnon-Bartsch, Jacob and Speed (2013)] and LEAPP [Ann. Appl. Stat. 6 (2012) 1664-1688], which correspond to two different identification conditions in the framework: the first requires a set of "negative controls" that are known a priori to follow the null distribution; the second requires the true nonnulls to be sparse. Two different estimators which are based on RUV-4 and LEAPP are then applied to these two scenarios. We show that if the confounding factors are strong, the resulting estimators can be asymptotically as powerful as the oracle estimator which observes the latent confounding factors. For hypothesis testing, we show the asymptotic z-tests based on the estimators can control the type I error. Numerical experiments show that the false discovery rate is also controlled by the Benjamini-Hochberg procedure when the sample size is reasonably large.

[1]  P. Hall,et al.  Robustness of multiple testing procedures against dependence , 2009, 0903.0464.

[2]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[3]  Chloé Friguet,et al.  A Factor Model Approach to Multiple Testing Under Dependence , 2009 .

[4]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[5]  J. Lindon,et al.  Scaling and normalization effects in NMR spectroscopic metabonomic data sets. , 2006, Analytical chemistry.

[6]  J. Pearl,et al.  Measurement bias and effect restoration in causal inference , 2014 .

[7]  Hugues Bersini,et al.  Batch effect removal methods for microarray gene expression data integration: a survey , 2013, Briefings Bioinform..

[8]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[9]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[10]  B. Efron Correlated z-Values and the Accuracy of Large-Scale Statistical Estimates , 2010, Journal of the American Statistical Association.

[11]  Jianqing Fan,et al.  Journal of the American Statistical Association Estimating False Discovery Proportion under Arbitrary Covariance Dependence Estimating False Discovery Proportion under Arbitrary Covariance Dependence , 2022 .

[12]  Herman Rubin,et al.  Statistical Inference in Factor Analysis , 1956 .

[13]  Wenguang Sun,et al.  Large‐scale multiple testing under dependence , 2009 .

[14]  Nancy R. Zhang,et al.  Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data , 2013, 1301.2420.

[15]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Alberto de la Fuente,et al.  Discovery of meaningful associations in genomic data using partial correlation coefficients , 2004, Bioinform..

[17]  Yiyuan She,et al.  Outlier Detection Using Nonconvex Penalized Regression , 2010, ArXiv.

[18]  V. Yohai HIGH BREAKDOWN-POINT AND HIGH EFFICIENCY ROBUST ESTIMATES FOR REGRESSION , 1987 .

[19]  Leslie Kish,et al.  Some Statistical Problems in Research Design , 1959 .

[20]  A. Onatski Determining the Number of Factors from Empirical Distribution of Eigenvalues , 2010, The Review of Economics and Statistics.

[21]  Kunpeng Li,et al.  Theory and methods of panel data models with interactive effects , 2014, 1402.6550.

[22]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[23]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[24]  Pablo A. Parrilo,et al.  Latent variable graphical model selection via convex optimization , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[25]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[26]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[27]  Kunpeng Li,et al.  Maximum Likelihood Estimation and Inference for Approximate Factor Models of High Dimension , 2016, Review of Economics and Statistics.

[28]  W. Markesbery,et al.  Incipient Alzheimer's disease: Microarray correlation analyses reveal major transcriptional and tumor suppressor responses , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[30]  Yudong D. He,et al.  Effects of atmospheric ozone on microarray data quality. , 2003, Analytical chemistry.

[31]  Andreas Ritter,et al.  Structural Equations With Latent Variables , 2016 .

[32]  Lilun Du,et al.  A Factor-Adjusted Multiple Testing Procedure With Application to Mutual Fund Selection , 2014, Journal of Business & Economic Statistics.

[33]  A. Schwartzman Comment on "Correlated z-values and the accuracy of large-scale statistical estimates" by Bradley Efron. , 2010, Journal of the American Statistical Association.

[34]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[35]  M. Pesaran General diagnostic tests for cross-sectional dependence in panels , 2004, Empirical Economics.

[36]  A. Chinnaiyan,et al.  Integrative analysis of the cancer transcriptome , 2005, Nature Genetics.

[37]  Kunpeng Li,et al.  Factor-augmented regression models with structural change , 2015 .

[38]  Bartolome Celli,et al.  Induced sputum genes associated with spirometric and radiological disease severity in COPD ex-smokers , 2011, Thorax.

[39]  M. Hubert,et al.  A Robust Measure of Skewness , 2004 .

[40]  R. Dougherty,et al.  FALSE DISCOVERY RATE ANALYSIS OF BRAIN DIFFUSION DIRECTION MAPS. , 2008, The annals of applied statistics.

[41]  J. Bai,et al.  Confidence Intervals for Diffusion Index Forecasts and Inference for Factor-Augmented Regressions , 2006 .

[42]  A. B. Owen,et al.  Bi-cross-validation for factor analysis , 2015, 1503.03515.

[43]  Kunpeng Li,et al.  STATISTICAL ANALYSIS OF FACTOR MODELS OF HIGH DIMENSION , 2012, 1205.6617.

[44]  Michel Grzebyk,et al.  On identification of multi-factor models with correlated residuals , 2004 .

[45]  J. Pearl,et al.  Confounding and Collapsibility in Causal Inference , 1999 .

[46]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[47]  Patrick O. Perry,et al.  Degrees of freedom for combining regression with factor analysis , 2013, 1310.7269.

[48]  A. Owen Variance of the number of false discoveries , 2005 .

[49]  J. I The Design of Experiments , 1936, Nature.

[50]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[51]  J. Bai,et al.  Determining the Number of Factors in Approximate Factor Models , 2000 .

[52]  John D. Storey,et al.  Cross-Dimensional Inference of Dependent High-Dimensional Data , 2012 .

[53]  Jianqing Fan,et al.  Estimation of the false discovery proportion with unknown dependence , 2013, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[54]  Pingshou Zhong,et al.  A rate optimal procedure for sparse signal recovery under dependence , 2014, 1410.2839.

[55]  R. Simon,et al.  Controlling the number of false discoveries: application to high-dimensional genomic data , 2004 .

[56]  D. Ransohoff Bias as a threat to the validity of cancer molecular-marker research , 2005, Nature reviews. Cancer.

[57]  R. Myers,et al.  Gender-Specific Gene Expression in Post-Mortem Human Brain: Localization to Sex Chromosomes , 2004, Neuropsychopharmacology.

[58]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[59]  Daniel W Lin,et al.  Influence of surgical manipulation on prostate gene expression: implications for molecular correlates of treatment effects and disease prognosis. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[60]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .