Principal component analysis for compositional data with outliers

Compositional data (almost all data in geochemistry) are closed data, that is they usually sum up to a constant (e.g. weight percent, wt.%) and carry only relative information. Thus, the covariance structure of compositional data is strongly biased and results of many multivariate techniques become doubtful without a proper transformation of the data. The centred logratio transformation (clr) is often used to open closed data. However the transformed data do not have full rank following a logratio transformation and cannot be used for robust multivariate techniques like principal component analysis (PCA). Here we propose to use the isometric logratio transformation (ilr) instead. However, the ilr transformation has the disadvantage that the resulting new variables are no longer directly interpretable in terms of the originally entered variables. Here we propose a technique how the resulting scores and loadings of a robust PCA on ilr transformed data can be back‐transformed and interpreted. The procedure is demonstrated using a real data set from regional geochemistry and compared to results from non‐transformed and non‐robust versions of PCA. It turns out that the procedure using ilr‐transformed data and robust PCA delivers superior results to all other approaches. The examples demonstrate that due to the compositional nature of geochemical data PCA should not be carried out without an appropriate transformation. Furthermore a robust approach is preferable if the dataset contains outliers. Copyright © 2009 John Wiley & Sons, Ltd.

[1]  P. Filzmoser,et al.  Outlier Detection for Compositional Data Using Robust Methods , 2008 .

[2]  Raimon Tolosana Delgado,et al.  Lecture Notes on Compositional Data Analysis , 2007 .

[3]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[4]  G. Mateu-Figueras,et al.  Isometric Logratio Transformations for Compositional Data Analysis , 2003 .

[5]  J. Eriksson,et al.  Agricultural soils in Northern Europe: a geochemical atlas. , 2003 .

[6]  J. Aitchison,et al.  Biplots of Compositional Data , 2002 .

[7]  P. Filzmoser,et al.  Normal and lognormal data distribution in geochemistry: death of a myth. Consequences for the statistical treatment of geochemical and environmental data , 2000 .

[8]  P. Filzmoser Robust principal component and factor analysis in the geostatistical treatment of environmental data , 1999 .

[9]  Katrien van Driessen,et al.  A Fast Algorithm for the Minimum Covariance Determinant Estimator , 1999, Technometrics.

[10]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[11]  J. Aitchison Reducing the dimensionality of compositional data sets , 1984 .

[12]  J. Aitchison Principal component analysis of compositional data , 1983 .

[13]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[14]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[15]  K. Gabriel,et al.  The biplot graphic display of matrices with application to principal component analysis , 1971 .