Unsupervised classification of multivariate geostatistical data: Two algorithms

With the increasing development of remote sensing platforms and the evolution of sampling facilities in mining and oil industry, spatial datasets are becoming increasingly large, inform a growing number of variables and cover wider and wider areas. Therefore, it is often necessary to split the domain of study to account for radically different behaviors of the natural phenomenon over the domain and to simplify the subsequent modeling step. The definition of these areas can be seen as a problem of unsupervised classification, or clustering, where we try to divide the domain into homogeneous domains with respect to the values taken by the variables in hand. The application of classical clustering methods, designed for independent observations, does not ensure the spatial coherence of the resulting classes. Image segmentation methods, based on e.g. Markov random fields, are not adapted to irregularly sampled data. Other existing approaches, based on mixtures of Gaussian random functions estimated via the expectation-maximization algorithm, are limited to reasonable sample sizes and a small number of variables. In this work, we propose two algorithms based on adaptations of classical algorithms to multivariate geostatistical data. Both algorithms are model free and can handle large volumes of multivariate, irregularly spaced data. The first one proceeds by agglomerative hierarchical clustering. The spatial coherence is ensured by a proximity condition imposed for two clusters to merge. This proximity condition relies on a graph organizing the data in the coordinates space. The hierarchical algorithm can then be seen as a graph-partitioning algorithm. Following this interpretation, a spatial version of the spectral clustering algorithm is also proposed. The performances of both algorithms are assessed on toy examples and a mining dataset.

[1]  G. Milligan Ultrametric hierarchical clustering algorithms , 1979 .

[2]  Xavier Guyon,et al.  Random fields on a network , 1995 .

[3]  D. Allard,et al.  Clustering geostatistical data , 2000 .

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[6]  J. Chilès,et al.  Geostatistics: Modeling Spatial Uncertainty , 1999 .

[7]  N. Reid,et al.  AN OVERVIEW OF COMPOSITE LIKELIHOOD METHODS , 2011 .

[8]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[9]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[10]  Jacques Rivoirard,et al.  Domaining by Clustering Multivariate Geostatistical Data , 2012 .

[11]  Gilles Celeux,et al.  EM procedures using mean field-like approximations for Markov model-based image segmentation , 2003, Pattern Recognit..

[12]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  R. Webster,et al.  A geostatistical basis for spatial weighting in multivariate classification , 1989 .

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  Joel Michelin,et al.  Inference of a hidden spatial tessellation from multivariate data: application to the delineation of homogeneous regions in an agricultural field , 2006 .

[16]  W. T. Williams,et al.  A Generalized Sorting Strategy for Computer Classifications , 1966, Nature.

[17]  Gérard Govaert,et al.  Clustering of Spatial Data by the EM Algorithm , 1997 .