Evaluation of graphical and multivariate statistical methods for classification of water chemistry data

Abstract.A robust classification scheme for partitioning water chemistry samples into homogeneous groups is an important tool for the characterization of hydrologic systems. In this paper we test the performance of the many available graphical and statistical methodologies used to classify water samples including: Collins bar diagram, pie diagram, Stiff pattern diagram, Schoeller plot, Piper diagram, Q-mode hierarchical cluster analysis, K-means clustering, principal components analysis, and fuzzy k-means clustering. All the methods are discussed and compared as to their ability to cluster, ease of use, and ease of interpretation. In addition, several issues related to data preparation, database editing, data-gap filling, data screening, and data quality assurance are discussed and a database construction methodology is presented.The use of graphical techniques proved to have limitations compared with the multivariate methods for large data sets. Principal components analysis is useful for data reduction and to assess the continuity/overlap of clusters or clustering/similarities in the data. The most efficient grouping was achieved by statistical clustering techniques. However, these techniques do not provide information on the chemistry of the statistical groups. The combination of graphical and statistical techniques provides a consistent and objective means to classify large numbers of samples while retaining the ease of classic graphical presentations.Résumé.Un système robuste de classification pour répartir des échantillons de chimie de l'eau en groupes homogènes est un outil important pour la caractérisation des hydrosystèmes. Dans ce papier nous testons les performances des nombreuses méthodes graphiques et statistiques disponibles utilisées pour réaliser une classification des échantillons d'eau; ces méthodes sont les suivantes: les diagrammes en barres de Collins, en camembert, de Stiff, de Schoeller, de Piper, l'analyse hiérarchique en grappe en mode Q, le regroupement de moyennes K, l'analyse en composantes principales et le regroupement flou de moyennes K. Toutes ces méthodes sont discutées et comparées quant à leur aptitude à regrouper et leur facilité de mise en œuvre et d'interprétation. En outre, plusieurs points relatifs à la préparation des données, à l'édition des bases de données, à la reconstitution de données manquantes, à l'examen des données et au contrôle de validité des données sont discutés et une méthodologie d'élaboration d'une base de données est proposée.L'utilisation de techniques graphiques a démontré qu'elle présente des limites par rapport aux méthodes multidimensionnelles, pour les jeux importants de données. L'analyse en composantes principales est utile pour réduire les données et pour évaluer la continuité/recouvrement des groupes ou le groupement/similitude dans les données. Le groupement le plus efficace est assuré par les techniques statistiques de regroupement en grappes. Cependant, ces techniques ne fournissent pas d'information sur le chimisme des groupes statistiques. La combinaison de techniques graphiques et statistiques donne les moyens solides et objectifs de faire une classification d'un grand nombre d'échantillons tout en conservant la facilité des représentations graphiques classiques.Resumen.Disponer de un esquema sólido de clasificación química de muestras de agua en grupos homogéneos es una herramienta importante para la caracterización de sistemas hidrológicos. En este artículo, contrastamos la utilidad de muchas metodologías gráficas y estadísticas disponibles para clasificar muestras de aguas; entre ellas, hay que citar el diagrama de barras de Collins, diagramas de sectores, diagrama de Stiff, gráfico de Schoeller, diagrama de Piper, análisis jerárquico de conglomerados en modo-Q, conglomerados de K-medias, análisis de componentes principales, y conglomerados difusos de k-medias. Se discute todos los métodos, comparándolos en función de su capacidad para establecer agrupaciones, de su facilidad de uso y de su facilidad de interpretación. Además, se discute varios aspectos relacionados con la entrada de datos, edición de bases de datos, extrapolación de datos en series incompletas, visualización de datos, y garantía de calidad de los datos, y se presenta una metodología para elaborar una base de datos.Se demuestra que el uso de técnicas gráficas padece limitaciones respecto a los métodos multivariados para conjuntos de datos numerosos. El análisis de componentes principales es útil para reducir el número de datos y establecer la continuidad/superposición de grupos o agrupaciones/similaridades en los datos. Los resultados más efectivos se logran mediante técnicas estadísticas de agrupamiento; sin embargo, éstas no proporcionan información sobre la química de los grupos estadísticos. La combinación de técnicas gráficas y estadísticas posibilita un enfoque coherente y objetivo para clasificar números elevados de muestras y, a la vez, mantener la facilidad de las presentaciones gráficas convencionales.

[1]  A. T. Miesch,et al.  Geochemical survey of Missouri; methods of sampling, laboratory analysis, and statistical reduction of data, with sections on laboratory methods , 1976 .

[2]  J. Drever,et al.  The geochemistry of natural waters , 1988 .

[3]  J. Hem Study and Interpretation of the Chemical Characteristics of Natural Water , 1989 .

[4]  W. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[5]  Carroll Lane Fenton,et al.  Physiography of Western United States , 1931 .

[6]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[7]  R. A. Hill Salts in Irrigation Water , 1942 .

[8]  George VanTrump,et al.  The U.S. geological survey rass-statpac system for management and statistical reduction of geochemical data , 1977 .

[9]  J. Stoddard,et al.  Major Ion Chemistry and Sensitivity to Acid Precipitation of Sierra Nevada Lakes , 1985 .

[10]  W. Back Hydrochemical facies and ground-water flow patterns in northern part of Atlantic Coastal Plain , 1966 .

[11]  Kenneth J. Hollett,et al.  Geology and water resources of Owens Valley, California , 1989 .

[12]  H. L. Gac Les eaux souterraines , 1989 .

[13]  R. A. Hill Geochemical patterns in Coachella Valley , 1940 .

[14]  Ashutosh Kumar Singh,et al.  Treatment of nondetects in multivariate analysis of groundwater geochemistry data , 2002 .

[15]  H. A. Stiff The Interpretation of Chemical Water Analysis by Means of Patterns , 1951 .

[16]  J. Skopp,et al.  Physical and Chemical Hydrogeology, 2nd edition , 1999 .

[17]  P. Domenico,et al.  Physical and chemical hydrogeology , 1990 .

[18]  R. Mariner,et al.  Geochemical evidence on the nature of the basement rocks of the Sierra Nevada, California , 1981 .

[19]  H. C. Chen,et al.  Uncertainties are Better Handled by Fuzzy Arithmetic (1) , 1990 .

[20]  R. A. Crovelli,et al.  An objective replacement method for censored geochemical data , 1993 .

[21]  R. Reyment,et al.  Statistics and Data Analysis in Geology. , 1988 .

[22]  J. H. Feth,et al.  Sources of mineral constituents in water from granitic rocks, Sierra Nevada, California and Nevada , 1964 .

[23]  A. McBratney,et al.  A continuum approach to soil classification by modified fuzzy k‐means with extragrades , 1992 .

[24]  F. Trombe Les eaux souterraines , 1977 .

[25]  Ashutosh Kumar Singh,et al.  Deciphering Groundwater Flow Systems in Oasis Valley, Nevada, Using Trace Element Chemistry, Multivariate Statistics, and Geographical Information System , 2000 .

[26]  G. A. Miller Appraisal of the water resources of Death Valley, California-Nevada , 1977 .

[27]  W. Wood Guidelines for collection and field analysis of ground-water samples for selected unstable constituents , 1976 .

[28]  Back William Techniques for mapping of hydrochemical facies; Article 423 , 1961 .

[29]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[30]  G. Maxey Hydrogeology of Desert Basinsa , 1968 .

[31]  Ulf Nordlund,et al.  Formalizing Geological Knowledge--With an Example of Modeling Stratigraphy Using Fuzzy Logic , 1996 .

[32]  R. Rummel Applied Factor Analysis , 1970 .

[33]  R. Froidevaux,et al.  Comparison of automatic classification methods applied to lake geochemical samples , 1975 .

[34]  T. J. Lopes Hydrology and water budget of Owens Lake, California , 1987 .

[35]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[36]  George A. Alther A Simplified Statistical Sequence Applied to Routine Water Quality Analysis: A Case History , 1979 .

[37]  M. Fishman,et al.  Methods for collection and analysis of water samples for dissolved minerals and gases , 1970 .

[38]  Herman Chernoff,et al.  The Use of Faces to Represent Points in k- Dimensional Space Graphically , 1973 .

[39]  W. D. Collins Graphic Representation of Water Analyses. , 1923 .

[40]  Daniel McNeill Fuzzy Logic: The Revolutionary Computer Technology That Is Changing Our World , 1993 .

[41]  Alex B. McBratney,et al.  Soil pattern recognition with fuzzy-c-means : application to classification and soil-landform interrelationships , 1992 .

[42]  J. M. Thompson,et al.  The recharge area for the Coso, California, geothermal system deduced from [delta]D and [delta]180 in thermal and non-thermal waters in the region , 1980 .

[43]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[44]  Roy E. Williams Statistical Identification of Hydraulic Connections Between the Surface of a Mountain and Internal Mineralized Sources , 1982 .

[45]  Arthur M. Piper,et al.  A graphic procedure in the geochemical interpretation of water-analyses , 1944 .

[46]  W. A. Bowles,et al.  Hydrologic basin, Death Valley, California , 1966 .

[47]  Alex B. McBratney,et al.  Application of fuzzy sets to climatic classification , 1985 .

[48]  I. Gibson Statistics and Data Analysis in Geology , 1976, Mineralogical Magazine.

[49]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[50]  Joseph K. Berry,et al.  Spatial reasoning for effective GIS , 1995 .

[51]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .