Detecting influential observations in principal components and common principal components

Detecting outlying observations is an important step in any analysis, even when robust estimates are used. In particular, the robustified Mahalanobis distance is a natural measure of outlyingness if one focuses on ellipsoidal distributions. However, it is well known that the asymptotic chi-square approximation for the cutoff value of the Mahalanobis distance based on several robust estimates (like the minimum volume ellipsoid, the minimum covariance determinant and the S-estimators) is not adequate for detecting atypical observations in small samples from the normal distribution. In the multi-population setting and under a common principal components model, aggregated measures based on standardized empirical influence functions are used to detect observations with a significant impact on the estimators. As in the one-population setting, the cutoff values obtained from the asymptotic distribution of those aggregated measures are not adequate for small samples. More appropriate cutoff values, adapted to the sample sizes, can be computed by using a cross-validation approach. Cutoff values obtained from a Monte Carlo study using S-estimators are provided for illustration. A real data set is also analyzed.

[1]  Ana M. Pires,et al.  Influence functions and outlier detection under the common principal components model: A robust approach , 2002 .

[2]  G. Boente,et al.  General projection-pursuit estimators for the common principal components model: influence functions and Monte Carlo study , 2006 .

[3]  David M. Rocke,et al.  The Distribution of Robust Distances , 2005 .

[4]  F. Critchley Influence in principal components analysis , 1985 .

[5]  Sven Serneels,et al.  Principal component analysis for data containing outliers and missing elements , 2008, Comput. Stat. Data Anal..

[6]  C. Croux,et al.  Principal Component Analysis Based on Robust Estimators of the Covariance or Correlation Matrix: Influence Functions and Efficiencies , 2000 .

[7]  E. Ziegel COMPSTAT: Proceedings in Computational Statistics , 1988 .

[8]  Lei Shi,et al.  Local influence in principal components analysis , 1997 .

[9]  B. Flury Common Principal Components in k Groups , 1984 .

[10]  Mia Hubert,et al.  Computational Statistics and Data Analysis Robust Pca for Skewed Data and Its Outlier Map , 2022 .

[11]  Ursula Gather,et al.  The largest nonindentifiable outlier: a comparison of multivariate simultaneous outlier identification rules , 2001 .

[12]  Peter Filzmoser,et al.  Outlier identification in high dimensions , 2008, Comput. Stat. Data Anal..

[13]  P. Rousseeuw,et al.  A robust version of principal factor analysis , 2000 .

[14]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[15]  Ursula Gather,et al.  The Largest Nonidentifiable Outlier , 2000 .

[16]  Peter J. Rousseeuw,et al.  ROBUST REGRESSION BY MEANS OF S-ESTIMATORS , 1984 .

[17]  Ursula Gather,et al.  The Masking Breakdown Point of Multivariate Outlier Identification Rules , 1999 .

[18]  Tao Chen,et al.  Robust probabilistic PCA with missing data and contribution analysis for outlier detection , 2009, Comput. Stat. Data Anal..

[19]  B. Flury Common Principal Components and Related Multivariate Models , 1988 .

[20]  P. Rousseeuw Multivariate estimation with high breakdown point , 1985 .