Factor analysis applied to regional geochemical data: problems and possibilities

Abstract Cluster analysis can be used to group samples and to develop ideas about the multivariate geochemistry of the data set at hand. Due to the complex nature of regional geochemical data (neither normal nor log-normal, strongly skewed, often multi-modal data distributions, data closure), cluster analysis results often strongly depend on the preparation of the data (e.g. choice of the transformation) and on the clustering algorithm selected. Different variants of cluster analysis can lead to surprisingly different cluster centroids, cluster sizes and classifications even when using exactly the same input data. Cluster analysis should not be misused as a statistical “proof” of certain relationships in the data. The use of cluster analysis as an exploratory data analysis tool requires a powerful program system to test different data preparation, processing and clustering methods, including the ability to present the results in a number of easy to grasp graphics. Such a tool has been developed as a package for the R statistical software. Two example data sets from geochemistry are used to demonstrate how the results change with different data preparation and clustering methods. A data set from S-Norway with a known number of clusters and cluster membership is used to test the performance of different clustering and data preparation techniques. For a complex data set from the Kola Peninsula, cluster analysis is applied to explore regional data structures.

[1]  P. Filzmoser,et al.  Normal and lognormal data distribution in geochemistry: death of a myth. Consequences for the statistical treatment of geochemical and environmental data , 2000 .

[2]  R. J. Howarth,et al.  Chapter 6 - Multivariate Analysis , 1983 .

[3]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[4]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[5]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[6]  N. M. S. Rock,et al.  Numerical geology , 1988 .

[7]  Bartlett Ms The use of transformations. , 1947 .

[8]  R. Fuge Environmental geochemical Atlas of the Central Barents region , 1999 .

[9]  J. Temple,et al.  The use of factor analysis in geology , 1978 .

[10]  W. L. Ruzzo,et al.  An empirical study on Principal Component Analysis for clustering gene expression data , 2000 .

[11]  Charles T. Kufs,et al.  Another view of the use of factor analysis in geology , 1979 .

[12]  Heikki Niskavaara,et al.  Reductive coprecipitation as a separation method for the determination of gold, palladium, platinum, rhodium, silver, selenium and tellurium in geological samples by graphite furnace atomic absorption spectrometry , 1990 .

[13]  H. E. Hawkes,et al.  Geochemistry in Mineral Exploration , 1962 .

[14]  Hongjin Ji,et al.  Correspondence cluster analysis and its application in exploration geochemistry , 1995 .

[15]  C. Y. Chork Unmasking multivariate anomalous observations in exploration geochemical data from sheeted-vein tin mineralization near Emmaville, N.S.W., Australia , 1990 .

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  R. Howarth Statistics and data analysis in geochemical prospecting , 1983 .

[18]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[19]  H. Harman Modern factor analysis , 1961 .

[20]  A. Cohen Truncated and Censored Samples , 1991 .

[21]  HalkidiMaria,et al.  Cluster validity methods , 2002 .

[22]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[23]  J. Aitchison,et al.  Logratio Analysis and Compositional Distance , 2000 .

[24]  H. E. Wright,et al.  Late Quaternary environments of the Soviet Union , 1985 .

[25]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[26]  R. J. Howarth,et al.  Application of a generalised power transform to geochemical data , 1979 .

[27]  D. Kendall,et al.  The Statistical Analysis of Variance‐Heterogeneity and the Logarithmic Transformation , 1946 .

[28]  Giuseppe Frapporti,et al.  Trace elements in the shallow ground water of The Netherlands. A geochemical and statistical interpretation of the national monitoring network data , 1996 .

[29]  R. Garrett The chi-square plot: a tool for multivariate outlier recognition , 1989 .

[30]  Dennis R. Helsel,et al.  Less than obvious - statistical treatment of data below the detection limit , 1990 .

[31]  Geologian tutkimuskeskus,et al.  Geochemical atlas of northern Fennoscandia , 1986 .

[32]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[33]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[34]  G.J.S. Govett,et al.  Comparison of interpretations of geochemical soil data by some multivariate statistical methods, Key Anacon, N.B., Canada , 1985 .

[35]  J. Carroll An analytical solution for approximating simple structure in factor analysis , 1953 .

[36]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[37]  Hans Bandemer,et al.  Fuzzy Data Analysis , 1992 .

[38]  Clemens Reimann,et al.  Monitoring accuracy and precision — Improvements by introducing robust and resistant statistics , 1986 .

[39]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[40]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[41]  Brian Everitt,et al.  Cluster analysis , 1974 .

[42]  A. T. Miesch Critical review of some multivariate procedures in the analysis of geochemical data , 1969 .

[43]  Abdelmonem A. Afifi,et al.  Statistical Analysis: A Computer Oriented Approach. , 1973 .

[44]  H. Kaiser The varimax criterion for analytic rotation in factor analysis , 1958 .

[45]  G. Mateu-Figueras,et al.  Isometric Logratio Transformations for Compositional Data Analysis , 2003 .

[46]  Eric Grunsky,et al.  Some aspects of transformations of compositional data and the identification of outliers , 1996 .

[47]  John C. Butler,et al.  Principal components analysis using the hypothetical closed array , 1976 .

[48]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[49]  Victor E. Kane,et al.  The management and analysis of regional geochemical data , 1980 .

[50]  Alan Julian Izenman,et al.  Modern Factor Analysis, 3rd Edition Revised.@@@Methods in Geomathematics. Volume I: Geological Factor Analysis. , 1978 .

[51]  J. Aitchison On criteria for measures of compositional difference , 1992 .

[52]  U. Siewers,et al.  Book reviewThe geochemical atlas of finland — Part 2: Till: T. Koljonen (editor). Geological Survey of Finland, Espoo, 1992, 218 pp., ISBN 951-690-379-7 (hardcover) , 1994 .

[53]  P. Filzmoser Robust principal component and factor analysis in the geostatistical treatment of environmental data , 1999 .

[54]  D. G. Simpson,et al.  Unmasking Multivariate Outliers and Leverage Points: Comment , 1990 .

[55]  Peter Filzmoser,et al.  Factor Analysis in a Robust Way , 1999 .

[56]  William N. Venables,et al.  Modern Applied Statistics with S-Plus. , 1996 .

[57]  R. A. Crovelli,et al.  An objective replacement method for censored geochemical data , 1993 .

[58]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[59]  Le Maitre,et al.  Numerical petrology : statistical interpretation of geochemical data , 1982 .

[60]  Isak Gath,et al.  Unsupervised Optimal Fuzzy Clustering , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[61]  C. Y. Chork,et al.  Interpreting exploration geochemical data from Outokumpu, Finland: a MVE-robust factor analysis , 1993 .

[62]  A. B. Vistelius The Skew Frequency Distributions and the Fundamental Law of the Geochemical Processes , 1960, The Journal of Geology.

[63]  P. O. White,et al.  PROMAX: A QUICK METHOD FOR ROTATION TO OBLIQUE SIMPLE STRUCTURE , 1964 .

[64]  J. R. Koehler,et al.  Modern Applied Statistics with S-Plus. , 1996 .

[65]  M. Bartlett,et al.  The use of transformations. , 1947, Biometrics.

[66]  Clemens Reimann,et al.  Metallogenic provinces, geochemical provinces and regional geology — what causes large-scale patterns in low density geochemical maps of the C-horizon of podzols in Arctic Europe? , 2001 .

[67]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[68]  R. Cattell The Scree Test For The Number Of Factors. , 1966, Multivariate behavioral research.

[69]  A. Basilevsky Statistical Factor Analysis and Related Methods: Theory and Applications , 1994 .

[70]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[71]  Jukka Turunen,et al.  Holocene vegetation history from the , 2002 .

[72]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[73]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[74]  Eion M. Cameron Geochemical atlas of Finland, part 2: Tills , 1993 .

[75]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[76]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[77]  Jens Feder,et al.  The fractal nature of geochemical landscapes , 1992 .

[78]  Donald Gustafson,et al.  Fuzzy clustering with a fuzzy covariance matrix , 1978, 1978 IEEE Conference on Decision and Control including the 17th Symposium on Adaptive Processes.

[79]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[80]  Thomas Martinetz,et al.  'Neural-gas' network for vector quantization and its application to time-series prediction , 1993, IEEE Trans. Neural Networks.

[81]  R. Reyment,et al.  Statistics and Data Analysis in Geology. , 1988 .

[82]  Hans Kürzl,et al.  Exploratory data analysis: recent advances for the interpretation of geochemical data , 1988 .

[83]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[84]  T. G. Verburg,et al.  The use of tree bark for environmental pollution monitoring in the Czech Republic , 1998 .

[85]  G. Kullerud,et al.  Geochemical and metallogenic provinces: a discussion initiated by results from geochemical mapping across northern Fennoscandia , 1990 .

[86]  John C. Butler,et al.  Complete subcompositional independence testing of closed arrays , 1985 .