High Dimensional Influence Measure

Influence diagnosis is important since presence of influential observations could lead to distorted analysis and misleading interpretations. For high dimensional data, it is particularly so, as the increased dimensionality and complexity may amplify both the chance of an observation being influential, and its potential impact on the analysis. In this article, we propose a novel high dimensional influence measure for regressions with the number of predictors far exceeding the sample size. Our proposal can be viewed as a high dimensional counterpart to the classical Cook's distance. However, whereas the Cook's distance quantifies the individual observation's influence on the least squares regression coefficient estimate, our new diagnosis measure captures the influence on the marginal correlations, which in turn exerts serious influence on downstream analysis including coefficient estimation, variable selection and screening. Moreover, we establish the asymptotic distribution of the proposed influence measure by letting the predictor dimension go to infinity. Availability of this asymptotic distribution leads to a principled rule to determine the critical value for influential observation detection. Both simulations and real data analysis demonstrate usefulness of the new influence diagnosis measure.

[1]  Y. Benjamini,et al.  On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics , 2000 .

[2]  R. Dennis Cook,et al.  Assessing influence on regression coefficients in generalized linear models , 1989 .

[3]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[4]  F. Critchley,et al.  Influence analysis based on the case sensitivity function , 2001 .

[5]  Andrew M. Kuhn,et al.  Growth Curve Models and Statistical Diagnostics , 2003, Technometrics.

[6]  A. Belloni,et al.  L1-Penalised quantile regression in high-dimensional sparse models , 2009 .

[7]  Liming Xiang,et al.  Influence diagnostics for generalized linear mixed models: applications to clustered data , 2002 .

[8]  W. Hays Applied Regression Analysis. 2nd ed. , 1981 .

[9]  Bo-Cheng Wei,et al.  Case-deletion measures for models with incomplete data , 2001 .

[10]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .

[11]  J. Ibrahim,et al.  Perturbation selection and influence measures in local influence analysis , 2007, 0803.2986.

[12]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[13]  Hansheng Wang,et al.  Robust Regression Shrinkage and Consistent Variable Selection Through the LAD-Lasso , 2007 .

[14]  B. Qaqish,et al.  Deletion diagnostics for generalised estimating equations , 1996 .

[15]  J. Horowitz,et al.  Asymptotic properties of bridge estimators in sparse high-dimensional regression models , 2008, 0804.0693.

[16]  Ronald Christensen,et al.  Case-deletion diagnostics for mixed models , 1992 .

[17]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[18]  Hongtu Zhu,et al.  PERTURBATION AND SCALED COOK'S DISTANCE. , 2012, Annals of statistics.

[19]  Anthony C. Davison,et al.  Regression model diagnostics , 1992 .

[20]  S. Chatterjee Sensitivity analysis in linear regression , 1988 .

[21]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[22]  Edward W. Frees,et al.  Influence Diagnostics for Linear Longitudinal Models , 1997 .

[23]  A. Belloni,et al.  L1-Penalized Quantile Regression in High Dimensional Sparse Models , 2009, 0904.2931.

[24]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[25]  V. Sheffield,et al.  Regulation of gene expression in the mammalian eye and its relevance to eye disease , 2006, Proceedings of the National Academy of Sciences.

[26]  Chenlei Leng,et al.  Unified LASSO Estimation by Least Squares Approximation , 2007 .

[27]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[28]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[29]  M. Banerjee Cook's distance in linear longitudinal models , 1998 .

[30]  Cun-Hui Zhang,et al.  Adaptive Lasso for sparse high-dimensional regression models , 2008 .

[31]  Erling B. Andersen Diagnostics in categorical data analysis , 1992 .

[32]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[33]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[34]  Wing K. Fung,et al.  Influence diagnostics and outlier tests for semiparametric mixed models , 2002 .

[35]  Thomas L Casavant,et al.  Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet-Biedl syndrome gene (BBS11). , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[36]  John D. Storey A direct approach to false discovery rates , 2002 .

[37]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[38]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[39]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[40]  D. A. Williams,et al.  Generalized Linear Model Diagnostics Using the Deviance and Single Case Deletions , 1987 .

[41]  R. Dennis Cook,et al.  Detection of Influential Observation in Linear Regression , 2000, Technometrics.

[42]  Hao Helen Zhang,et al.  Adaptive Lasso for Cox's proportional hazards model , 2007 .

[43]  R. Cook Influential Observations in Linear Regression , 1979 .

[44]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[45]  A. Fielding Sensitivity Analysis in Linear Regression , 1990 .

[46]  Hansheng Wang Forward Regression for Ultra-High Dimensional Variable Screening , 2009 .