Wahrscheinlichkeitstheorie Cluster analysis applied to regional geochemical data : Problems and possibilities

A large regional geochemical data set of O-horizon samples from a 188,000 km2 area in the European Arctic, analysed for 38 chemical elements, pH, electrical conductivity (both in a water extraction) and loss on ignition (LOI, 480 oC), was used to test the influence of different variants of cluster analysis on the results obtained. Due to the nature of regional geochemical data (neither normal nor log-normal, strongly skewed, often multi-modal data distributions), cluster analysis results usually strongly depend on the clustering algorithm selected. Deleting or adding just one element (variable) in the input matrix can also drastically change the results of cluster analysis. Different variants of cluster analysis can lead to surprisingly different results even when using exactly the same input data. Given that selection of elements is often based on availability of analytical packages (or detection limits) rather than on geochemical reasoning this is a disturbing result. Cluster analysis can be used to group samples and to develop ideas about the multivariate geochemistry of the data set at hand. It should not be misused as a statistical "proof" of certain relationships in the data. The use of cluster analysis as an exploratory data analysis tool requires a powerful program system, able to present the results in a number of easy to grasp graphics. In the context of this work, such a tool has been developed as a package for the R statistical software.

[1]  Clemens Reimann,et al.  Multivariate outlier detection in exploration geochemistry , 2005, Comput. Geosci..

[2]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[5]  Clemens Reimann,et al.  Factor analysis applied to regional geochemical data: problems and possibilities , 2002 .

[6]  P. Filzmoser,et al.  Normal and lognormal data distribution in geochemistry: death of a myth. Consequences for the statistical treatment of geochemical and environmental data , 2000 .

[7]  W. L. Ruzzo,et al.  An empirical study on Principal Component Analysis for clustering gene expression data , 2000 .

[8]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[9]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[10]  Hans-Jürgen Zimmermann,et al.  Fuzzy Data Analysis , 1996 .

[11]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[12]  Thomas Martinetz,et al.  'Neural-gas' network for vector quantization and its application to time-series prediction , 1993, IEEE Trans. Neural Networks.

[13]  Isak Gath,et al.  Unsupervised Optimal Fuzzy Clustering , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  N. M. S. Rock,et al.  Numerical geology , 1988 .

[15]  Clemens Reimann,et al.  Monitoring accuracy and precision — Improvements by introducing robust and resistant statistics , 1986 .

[16]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[17]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[18]  Donald Gustafson,et al.  Fuzzy clustering with a fuzzy covariance matrix , 1978, 1978 IEEE Conference on Decision and Control including the 17th Symposium on Adaptive Processes.

[19]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[20]  R. Reyment,et al.  Statistics and Data Analysis in Geology. , 1988 .

[21]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[22]  K. Kojima Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. , 1969 .

[23]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[24]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[25]  D. Cox,et al.  An Analysis of Transformations , 1964 .