Large covariance estimation by thresholding principal orthogonal complements

The paper deals with the estimation of a high dimensional covariance with a conditional sparsity structure and fast diverging eigenvalues. By assuming a sparse error covariance matrix in an approximate factor model, we allow for the presence of some cross‐sectional correlation even after taking out common but unobservable factors. We introduce the principal orthogonal complement thresholding method ‘POET’ to explore such an approximate factor structure with sparsity. The POET‐estimator includes the sample covariance matrix, the factor‐based covariance matrix, the thresholding estimator and the adaptive thresholding estimator as specific examples. We provide mathematical insights when the factor analysis is approximately the same as the principal component analysis for high dimensional data. The rates of convergence of the sparse residual covariance matrix and the conditional sparse covariance matrix are studied under various norms. It is shown that the effect of estimating the unknown factors vanishes as the dimensionality increases. The uniform rates of convergence for the unobserved factors and their factor loadings are derived. The asymptotic results are also verified by extensive simulation studies. Finally, a real data application on portfolio allocation is presented.

[1]  Sanne Engelen,et al.  A comparison of three procedures for robust PCA in high dimensions , 2016 .

[2]  J. Lewellen The Cross Section of Expected Stock Returns , 2014 .

[3]  P. Fryzlewicz High-dimensional volatility matrix estimation via wavelets and thresholding , 2013 .

[4]  Mohsen Pourahmadi,et al.  High-Dimensional Covariance Estimation , 2013 .

[5]  Seung-Jean Kim,et al.  Condition‐number‐regularized covariance estimation , 2013, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[6]  Yufeng Liu,et al.  Statistical Significance of Clustering Using Soft Thresholding , 2013, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[7]  G. Pan,et al.  Convergence of the largest eigenvalue of normalized sample covariance matrices when $p$ and $n$ both tend to infinity with their ratio converging to zero , 2012, 1211.5479.

[8]  Harrison H. Zhou,et al.  OPTIMAL RATES OF CONVERGENCE FOR SPARSE COVARIANCE MATRIX ESTIMATION , 2012, 1302.3030.

[9]  Han Liu,et al.  TIGER: A Tuning-Insensitive Approach for Optimally Estimating Gaussian Graphical Models , 2012, 1209.2437.

[10]  H. Zou,et al.  Positive Definite $\ell_1$ Penalized Estimation of Large Covariance Matrices , 2012, 1208.5702.

[11]  J. S. Marron,et al.  Boundary behavior in High Dimension, Low Sample Size asymptotics of PCA , 2012, J. Multivar. Anal..

[12]  Debdeep Pati,et al.  Posterior contraction in sparse Bayesian factor models for massive covariance matrices , 2012, 1206.3627.

[13]  Clifford Lam,et al.  Factor modeling for high-dimensional time series: inference for the number of factors , 2012, 1206.0613.

[14]  B. Nadler,et al.  MINIMAX BOUNDS FOR SPARSE PCA WITH NOISY HIGH-DIMENSIONAL DATA. , 2012, Annals of statistics.

[15]  Hansheng Wang,et al.  Factor profiled sure independence screening , 2012 .

[16]  M. Pesaran,et al.  Testing CAPM with a Large Number of Assets , 2012, SSRN Electronic Journal.

[17]  Larry A. Wasserman,et al.  High Dimensional Semiparametric Gaussian Copula Graphical Models. , 2012, ICML 2012.

[18]  Kunpeng Li,et al.  STATISTICAL ANALYSIS OF FACTOR MODELS OF HIGH DIMENSION , 2012, 1205.6617.

[19]  Zongming Ma Sparse Principal Component Analysis and Iterative Thresholding , 2011, 1112.2432.

[20]  Laurent El Ghaoui,et al.  Large-Scale Sparse Principal Component Analysis with Application to Text Data , 2011, NIPS.

[21]  Clifford Lam,et al.  Estimation of latent factors for high-dimensional time series , 2011 .

[22]  Xi Luo High Dimensional Low Rank and Sparse Covariance Matrix Estimation via Convex Minimization , 2011 .

[23]  Charles E McCulloch,et al.  A Flexible Estimating Equations Approach for Mapping Function-Valued Traits , 2011, Genetics.

[24]  Derek Bingham,et al.  Efficient emulators of computer experiments using compactly supported correlation functions, with an application to cosmology , 2011, 1107.0749.

[25]  Roman Liska,et al.  Dynamic factors in the presence of blocks , 2011 .

[26]  Jianqing Fan,et al.  High Dimensional Covariance Matrix Estimation in Approximate Factor Models , 2011, Annals of statistics.

[27]  Dan Shen,et al.  Consistency of sparse PCA in High Dimension, Low Sample Size contexts , 2011, J. Multivar. Anal..

[28]  Martin J. Wainwright,et al.  Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions , 2011, ICML.

[29]  Jianqing Fan,et al.  Control of the False Discovery Rate Under Arbitrary Covariance Dependence , 2010, 1012.4397.

[30]  Matteo Barigozzi,et al.  Improved penalization for determining the number of factors in approximate factor models , 2010 .

[31]  A. Onatski Determining the Number of Factors from Empirical Distribution of Eigenvalues , 2010, The Review of Economics and Statistics.

[32]  B. Efron Correlated z-Values and the Accuracy of Large-Scale Statistical Estimates , 2010, Journal of the American Statistical Association.

[33]  George Kapetanios,et al.  A Testing Procedure for Determining the Number of Factors in Approximate Factor Models With Large Datasets , 2010 .

[34]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[35]  John Wright,et al.  Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization , 2009, NIPS.

[36]  John S. Yap,et al.  Nonparametric Modeling of Longitudinal Covariance Structure in Functional Mapping of Quantitative Trait Loci , 2009, Biometrics.

[37]  J. Marron,et al.  PCA CONSISTENCY IN HIGH DIMENSION, LOW SAMPLE SIZE CONTEXT , 2009, 0911.3827.

[38]  A. Onatski TESTING HYPOTHESES ABOUT THE NUMBER OF FACTORS IN LARGE FACTOR MODELS , 2009 .

[39]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[40]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[41]  Enrique Sentana,et al.  The Econometrics of Mean-Variance Efficiency Tests: A Survey , 2009 .

[42]  Erniel B. Barrios,et al.  Principal components analysis of nonstationary time series data , 2009, Stat. Comput..

[43]  M. Pesaran,et al.  Weak and Strong Cross-Section Dependence and Estimation of Large Panels , 2009, SSRN Electronic Journal.

[44]  David E. Tyler,et al.  Invariant co‐ordinate selection , 2009 .

[45]  Rainer von Sachs,et al.  Shrinkage estimation in the frequency domain of multivariate time series , 2009, J. Multivar. Anal..

[46]  Adam J. Rothman,et al.  Generalized Thresholding of Large Covariance Matrices , 2009 .

[47]  E. Rio,et al.  A Bernstein type inequality and moderate deviations for weakly dependent sequences , 2009, 0902.0582.

[48]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[49]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[50]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[51]  D. Nychka,et al.  Covariance Tapering for Likelihood-Based Estimation in Large Spatial Data Sets , 2008 .

[52]  Jianqing Fan,et al.  Asset Allocation and Risk Assessment with Gross Exposure Constraints for Vast Portfolios , 2008, 0812.2604.

[53]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[54]  P. Bauer,et al.  Optimized multi‐stage designs controlling the false discovery or the family‐wise error rate , 2008, Statistics in medicine.

[55]  A. Nobel,et al.  Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data , 2008 .

[56]  Jianhua Z. Huang,et al.  Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[57]  J. Bai,et al.  Large Dimensional Factor Analysis , 2008 .

[58]  Z. Bai,et al.  Central limit theorems for eigenvalues in a spiked population model , 2008, 0806.2503.

[59]  R. Sachs,et al.  Structural shrinkage of nonparametric spectral estimators for multivariate time series , 2008, 0804.4738.

[60]  M. Wainwright,et al.  High-dimensional analysis of semidefinite relaxations for sparse principal components , 2008, 2008 IEEE International Symposium on Information Theory.

[61]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[62]  Anestis Antoniadis,et al.  Wavelet methods in statistics: Some recent developments and their applications , 2007, 0712.0283.

[63]  Jianqing Fan,et al.  Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation. , 2007, Annals of statistics.

[64]  Alexandre d'Aspremont,et al.  Optimal Solutions for Sparse Principal Component Analysis , 2007, J. Mach. Learn. Res..

[65]  M. Hallin,et al.  Determining the Number of Factors in the General Dynamic Factor Model , 2007 .

[66]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[67]  Jianqing Fan,et al.  High dimensional covariance matrix estimation using a factor model , 2007, math/0701124.

[68]  Catherine Doz,et al.  A Two-Step Estimator for Large Approximate Dynamic Factor Models Based on Kalman Filtering , 2007 .

[69]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[70]  K. Athreya,et al.  Measure Theory and Probability Theory , 2006 .

[71]  K. Athreya,et al.  Measure Theory and Probability Theory (Springer Texts in Statistics) , 2006 .

[72]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[73]  Serena Ng,et al.  Are more data always better for factor analysis , 2006 .

[74]  Cordelia Schmid,et al.  High-dimensional data clustering , 2006, Comput. Stat. Data Anal..

[75]  Peter Bauer,et al.  Two-stage designs for experiments with a large number of hypotheses , 2005, Bioinform..

[76]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[77]  James Franklin The elements of statistical learning: data mining, inference and prediction , 2005 .

[78]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[79]  M. Pesaran Estimation and Inference in Large Heterogeneous Panels with a Multifactor Error Structure , 2004, SSRN Electronic Journal.

[80]  J. W. Silverstein,et al.  Eigenvalues of large sample covariance matrices of spiked population models , 2004, math/0408165.

[81]  Marco Lippi,et al.  The generalized dynamic factor model: consistency and rates , 2004 .

[82]  Olivier Ledoit,et al.  A well-conditioned estimator for large-dimensional covariance matrices , 2004 .

[83]  Olivier Ledoit,et al.  Improved estimation of the covariance matrix of stock returns with an application to portfolio selection , 2003 .

[84]  J. Stock,et al.  Forecasting Using Principal Components From a Large Number of Predictors , 2002 .

[85]  R. Jagannathan,et al.  Risk Reduction in Large Portfolios: Why Imposing the Wrong Constraints Helps , 2002 .

[86]  J. Stock,et al.  Macroeconomic Forecasting Using Diffusion Indexes , 2002 .

[87]  Marco Lippi,et al.  THE GENERALIZED DYNAMIC FACTOR MODEL: REPRESENTATION THEORY , 2001, Econometric Theory.

[88]  Jianqing Fan,et al.  Regularization of Wavelet Approximations , 2001 .

[89]  Peter Schmidt,et al.  GMM estimation of linear panel data models with time-varying individual effects , 2001 .

[90]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[91]  Noel A Cressie,et al.  Analysis of spatial point patterns using bundles of product density LISA functions , 2001 .

[92]  M. Hallin,et al.  The Generalized Dynamic-Factor Model: Identification and Estimation , 2000, Review of Economics and Statistics.

[93]  J. Bai,et al.  Determining the Number of Factors in Approximate Factor Models , 2000 .

[94]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[95]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[96]  A. Raftery,et al.  Nearest-Neighbor Clutter Removal for Estimating Features in Spatial Point Processes , 1998 .

[97]  E. Fama,et al.  Common risk factors in the returns on stocks and bonds , 1993 .

[98]  Gary Chamberlain,et al.  FUNDS, FACTORS, AND DIVERSIFICATION IN ARBITRAGE PRICING MODELS , 1983 .

[99]  M. Rothschild,et al.  Arbitrage, Factor Structure, and Mean-Variance Analysis on Large Asset Markets , 1982 .

[100]  S. Ross The arbitrage theory of capital asset pricing , 1976 .

[101]  W. Kahan,et al.  The Rotation of Eigenvectors by a Perturbation. III , 1970 .

[102]  W. Sharpe CAPITAL ASSET PRICES: A THEORY OF MARKET EQUILIBRIUM UNDER CONDITIONS OF RISK* , 1964 .

[103]  A. E. Maxwell,et al.  Factor Analysis as a Statistical Method. , 1964 .

[104]  Jianqing Fan,et al.  COVARIANCE ASSISTED SCREENING AND ESTIMATION. , 2014, Annals of statistics.

[105]  Shuzhong Shi,et al.  Estimating High Dimensional Covariance Matrices and its Applications , 2011 .

[106]  Arvind Ganesh,et al.  Fast Convex Optimization Algorithms for Exact Recovery of a Corrupted Low-Rank Matrix , 2009 .

[107]  C. Robert Discussion of "Sure independence screening for ultra-high dimensional feature space" by Fan and Lv. , 2008 .

[108]  D. Paul ASYMPTOTICS OF SAMPLE EIGENSTRUCTURE FOR A LARGE DIMENSIONAL SPIKED COVARIANCE MODEL , 2007 .

[109]  J. Bai,et al.  Inferential Theory for Factor Models of Large Dimensions , 2003 .

[110]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[111]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[112]  S. D. Chatterji Measure theory and probability theory , 1998 .

[113]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[114]  J. C. Gower,et al.  Factor Analysis as a Statistical Method. 2nd ed. , 1972 .

[115]  Ke Yu,et al.  Constraints , 2019, Sexual Selection.

[116]  T. Cai,et al.  Journal of the American Statistical Association Adaptive Thresholding for Sparse Covariance Matrix Estimation Adaptive Thresholding for Sparse Covariance Matrix Estimation , 2022 .

[117]  Jianqing Fan,et al.  Journal of the American Statistical Association Estimating False Discovery Proportion under Arbitrary Covariance Dependence Estimating False Discovery Proportion under Arbitrary Covariance Dependence , 2022 .

[118]  Martina Mincheva,et al.  Large Covariance Estimation by Thresholding Principal Orthogonal Complements , 2022 .