Heterogeneity adjustment with applications to graphical model inference.

Heterogeneity is an unwanted variation when analyzing aggregated datasets from multiple sources. Though different methods have been proposed for heterogeneity adjustment, no systematic theory exists to justify these methods. In this work, we propose a generic framework named ALPHA (short for Adaptive Low-rank Principal Heterogeneity Adjustment) to model, estimate, and adjust heterogeneity from the original data. Once the heterogeneity is adjusted, we are able to remove the batch effects and to enhance the inferential power by aggregating the homogeneous residuals from multiple sources. Under a pervasive assumption that the latent heterogeneity factors simultaneously affect a fraction of observed variables, we provide a rigorous theory to justify the proposed framework. Our framework also allows the incorporation of informative covariates and appeals to the 'Bless of Dimensionality'. As an illustrative application of this generic framework, we consider a problem of estimating high-dimensional precision matrix for graphical model inference based on multiple datasets. We also provide thorough numerical studies on both synthetic datasets and a brain imaging dataset to demonstrate the efficacy of the developed theory and methods.

[1]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  Christian Windischberger,et al.  Toward discovery science of human brain function , 2010, Proceedings of the National Academy of Sciences.

[4]  Po-Ling Loh,et al.  Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses , 2012, NIPS.

[5]  Martin J. Wainwright,et al.  Estimation of (near) low-rank matrices with noise and high-dimensional scaling , 2009, ICML.

[6]  Jianqing Fan,et al.  Asymptotics of Empirical Eigen-structure for Ultra-high Dimensional Spiked Covariance Model , 2015, 1502.04733.

[7]  Clifford Lam,et al.  Factor modeling for high-dimensional time series: inference for the number of factors , 2012, 1206.0613.

[8]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[9]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[10]  F. Dias,et al.  Determining the number of factors in approximate factor models with global and group-specific factors , 2008 .

[11]  G. Lorentz Approximation of Functions , 1966 .

[12]  Kaustubh Supekar,et al.  Aberrant Cross-Brain Network Interaction in Children With Attention-Deficit/Hyperactivity Disorder and Its Relation to Attention Deficits: A Multisite and Cross-Site Replication Study. , 2015, Biological psychiatry.

[13]  J. Bai,et al.  Inferential Theory for Factor Models of Large Dimensions , 2003 .

[14]  Gregory Connor,et al.  Semiparametric Estimation of a Characteristic-Based Factor Model of Stock Returns , 2000 .

[15]  J. Bai,et al.  Principal components estimation and identification of static factors , 2013 .

[16]  Timothy O. Laumann,et al.  Functional Network Organization of the Human Brain , 2011, Neuron.

[17]  Jianqing Fan,et al.  PROJECTED PRINCIPAL COMPONENT ANALYSIS IN FACTOR MODELS. , 2014, Annals of statistics.

[18]  J. Marron,et al.  Surprising Asymptotic Conical Structure in Critical Sample Eigen-Directions , 2013, 1303.6171.

[19]  Ming Yuan,et al.  High Dimensional Inverse Covariance Matrix Estimation via Linear Programming , 2010, J. Mach. Learn. Res..

[20]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[21]  Jianqing Fan,et al.  Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation. , 2007, Annals of statistics.

[22]  O. Linton,et al.  EFFICIENT SEMIPARAMETRIC ESTIMATION OF THE FAMA-FRENCH MODEL AND EXTENSIONS , 2012 .

[23]  Sham M. Kakade,et al.  A tail inequality for quadratic forms of subgaussian random vectors , 2011, ArXiv.

[24]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Crispin J. Miller,et al.  The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets – improving meta-analysis and prediction of prognosis , 2008, BMC Medical Genomics.

[26]  Xiaotong Shen,et al.  Journal of the American Statistical Association Likelihood-based Selection and Sharp Parameter Estimation Likelihood-based Selection and Sharp Parameter Estimation , 2022 .

[27]  Larry A. Wasserman,et al.  The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs , 2009, J. Mach. Learn. Res..

[28]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[29]  M. Hallin,et al.  Determining the Number of Factors in the General Dynamic Factor Model , 2007 .

[30]  Fang Han,et al.  Transelliptical Graphical Models , 2012, NIPS.

[31]  Ying Pang Forecasting using principal components from many predictors , 2011 .

[32]  Jianqing Fan,et al.  Large covariance estimation by thresholding principal orthogonal complements , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[33]  Seung C. Ahn,et al.  Eigenvalue Ratio Test for the Number of Factors , 2013 .

[34]  T. Cai,et al.  A Constrained ℓ1 Minimization Approach to Sparse Precision Matrix Estimation , 2011, 1102.2233.

[35]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[36]  J. Stock,et al.  Forecasting Using Principal Components From a Large Number of Predictors , 2002 .

[37]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[38]  T. Cai,et al.  Sparse PCA: Optimal rates and adaptive estimation , 2012, 1211.1309.

[39]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[40]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[41]  Bin Yu,et al.  High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence , 2008, 0811.3628.

[42]  Xiaohong Chen Chapter 76 Large Sample Sieve Estimation of Semi-Nonparametric Models , 2007 .

[43]  D. Paul ASYMPTOTICS OF SAMPLE EIGENSTRUCTURE FOR A LARGE DIMENSIONAL SPIKED COVARIANCE MODEL , 2007 .

[44]  Hongzhe Li,et al.  Covariate-Adjusted Precision Matrix Estimation with an Application in Genetical Genomics. , 2013, Biometrika.

[45]  Alexei Onatski,et al.  Asymptotics of the principal components estimator of large factor models with weakly influential factors , 2012 .