Estimation of the false discovery proportion with unknown dependence

Large-scale multiple testing with correlated test statistics arises frequently in many scientific research. Incorporating correlation information in approximating false discovery proportion has attracted increasing attention in recent years. When the covariance matrix of test statistics is known, Fan, Han & Gu (2012) provided an accurate approximation of False Discovery Proportion (FDP) under arbitrary dependence structure and some sparsity assumption. However, the covariance matrix is often unknown in many applications and such dependence information has to be estimated before approximating FDP. The estimation accuracy can greatly affect FDP approximation. In the current paper, we aim to theoretically study the impact of unknown dependence on the testing procedure and establish a general framework such that FDP can be well approximated. The impacts of unknown dependence on approximating FDP are in the following two major aspects: through estimating eigenvalues/eigenvectors and through estimating marginal variances. To address the challenges in these two aspects, we firstly develop general requirements on estimates of eigenvalues and eigenvectors for a good approximation of FDP. We then give conditions on the structures of covariance matrices that satisfy such requirements. Such dependence structures include banded/sparse covariance matrices and (conditional) sparse precision matrices. Within this framework, we also consider a special example to illustrate our method where data are sampled from an approximate factor model, which encompasses most practical situations. We provide a good approximation of FDP via exploiting this specific dependence structure. The results are further generalized to the situation where the multivariate normality assumption is relaxed. Our results are demonstrated by simulation studies and some real data applications.

[1]  T. Cai,et al.  A Constrained ℓ1 Minimization Approach to Sparse Precision Matrix Estimation , 2011, 1102.2233.

[2]  John D. Storey,et al.  Cross-Dimensional Inference of Dependent High-Dimensional Data , 2012 .

[3]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.

[4]  M. Rothschild,et al.  Arbitrage, Factor Structure, and Mean-Variance Analysis on Large Asset Markets , 1982 .

[5]  Weidong Liu,et al.  Adaptive Thresholding for Sparse Covariance Matrix Estimation , 2011, 1102.2237.

[6]  Chloé Friguet,et al.  A Factor Model Approach to Multiple Testing Under Dependence , 2009 .

[7]  P. Hall,et al.  Robustness of multiple testing procedures against dependence , 2009, 0903.0464.

[8]  Wenguang Sun,et al.  Large‐scale multiple testing under dependence , 2009 .

[9]  J. Bai,et al.  Determining the Number of Factors in Approximate Factor Models , 2000 .

[10]  Adam J. Rothman,et al.  Sparse permutation invariant covariance estimation , 2008, 0801.4837.

[11]  Jianqing Fan,et al.  Journal of the American Statistical Association Estimating False Discovery Proportion under Arbitrary Covariance Dependence Estimating False Discovery Proportion under Arbitrary Covariance Dependence , 2022 .

[12]  Jianqing Fan,et al.  High Dimensional Covariance Matrix Estimation in Approximate Factor Models , 2011, Annals of statistics.

[13]  Xihong Lin,et al.  The effect of correlation in false discovery rate estimation. , 2011, Biometrika.

[14]  Jianqing Fan,et al.  Regularization of Wavelet Approximations , 2001 .

[15]  R. Engle,et al.  A One-Factor Multivariate Time Series Model of Metropolitan Wage Rates , 1981 .

[16]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[17]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[18]  Y. Fujikoshi,et al.  Approximations for the quantiles of Student's $t$ and $F$ distributions and their error bounds , 1993 .

[19]  D. Donoho,et al.  Asymptotic Minimaxity Of False Discovery Rate Thresholding For Sparse Exponential Data , 2006, math/0602311.

[20]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[21]  A. Schwartzman,et al.  The Empirical Distribution of a Large Number of Correlated Normal Variables , 2015, Journal of the American Statistical Association.

[22]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[23]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[24]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[25]  M. M. Siddiqui A Bivariate $t$ Distribution , 1967 .

[26]  Zongming Ma Sparse Principal Component Analysis and Iterative Thresholding , 2011, 1112.2432.

[27]  W. Kahan,et al.  The Rotation of Eigenvectors by a Perturbation. III , 1970 .

[28]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[29]  Jianqing Fan,et al.  Large covariance estimation by thresholding principal orthogonal complements , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[30]  Seung C. Ahn,et al.  Eigenvalue Ratio Test for the Number of Factors , 2013 .

[31]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[32]  Noureddine El Karoui,et al.  Operator norm consistent estimation of large-dimensional sparse covariance matrices , 2008, 0901.3220.

[33]  Gary Chamberlain,et al.  FUNDS, FACTORS, AND DIVERSIFICATION IN ARBITRAGE PRICING MODELS , 1983 .

[34]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[35]  John D. Storey A direct approach to false discovery rates , 2002 .

[36]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[37]  Adam J. Rothman,et al.  Generalized Thresholding of Large Covariance Matrices , 2009 .

[38]  J. Bai,et al.  Inferential Theory for Factor Models of Large Dimensions , 2003 .

[39]  N. Higham COMPUTING A NEAREST SYMMETRIC POSITIVE SEMIDEFINITE MATRIX , 1988 .

[40]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[41]  Olivier Ledoit,et al.  Improved estimation of the covariance matrix of stock returns with an application to portfolio selection , 2003 .

[42]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[43]  Piotr Fryzlewicz,et al.  NOVELIST estimator of large correlation and covariance matrices and their inverses , 2018, TEST.

[44]  S. Sarkar Some Results on False Discovery Rate in Stepwise multiple testing procedures , 2002 .

[45]  Runlong Tang,et al.  Partial Consistency with Sparse Incidental Parameters. , 2012, Statistica Sinica.

[46]  Clifford Lam,et al.  Factor modeling for high-dimensional time series: inference for the number of factors , 2012, 1206.0613.

[47]  B. Efron Correlated z-Values and the Accuracy of Large-Scale Statistical Estimates , 2010, Journal of the American Statistical Association.