A Spectral Clustering Method for Large-Scale Geostatistical Datasets

Spectral clustering is one of the most popular modern clustering techniques for conventional data. However, the application of the general spectral clustering method in the geostatistical data framework poses a double challenge. Firstly, applied to geostatistical data, the general spectral clustering method produces clusters that are spatially non-contiguous which is undesirable for many geoscience applications. Secondly, it is limited in its applicability to large-scale problems due to its high computational complexity. This paper presents a spectral clustering method dedicated to large-scale geostatistical datasets in which spatial dependence plays an important role. It extends a previous work to large-scale geostatistical datasets by computing the similarity matrix only at a reduced set of locations over the study domain referred to as anchor locations. It has the advantage of using all data during the computation of the similarity matrix at anchor locations; so there is no sacrifice of data. The spectral clustering algorithm can then be efficiently performed on this similarity matrix at anchor locations rather than all data locations. Given the resulting cluster labels of anchor locations, a weighted k-nearest-neighbour classifier is trained using their geographical coordinates as covariates and their cluster labels as the response. The assignment of clustering membership to the entire data locations is obtained by applying the trained classifier. The effectiveness of the proposed method to discover spatially contiguous and meaningful clusters in large-scale geostatistical datasets is illustrated using the US National Geochemical Survey database.

[1]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[2]  Kotagiri Ramamohanarao,et al.  Approximate Spectral Clustering , 2009, PAKDD.

[3]  Santosh S. Vempala,et al.  On clusterings: Good, bad and spectral , 2004, JACM.

[4]  Pierre Vandergheynst,et al.  Compressive Spectral Clustering , 2016, ICML.

[5]  J. Chilès,et al.  Geostatistics: Modeling Spatial Uncertainty , 1999 .

[6]  Francky Fouedjio,et al.  Discovering Spatially Contiguous Clusters in Multivariate Geostatistical Data Through Spectral Clustering , 2016, ADMA.

[7]  Charu C. Aggarwal,et al.  Data Clustering , 2013 .

[8]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Ulrike von Luxburg,et al.  Limits of Spectral Clustering , 2004, NIPS.

[10]  Petros Daras,et al.  Large-scale spectral clustering based on pairwise constraints , 2015, Inf. Process. Manag..

[11]  Anna Choromanska,et al.  Fast Spectral Clustering via the Nyström Method , 2013, ALT.

[12]  Klaus Hechenbichler,et al.  Weighted k-Nearest-Neighbor Techniques and Ordinal Classification , 2004 .

[13]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[14]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Spectral methods for graph clustering - A survey , 2011, Eur. J. Oper. Res..

[15]  Dirong Chen,et al.  Consistency of regularized spectral clustering , 2011 .

[16]  Jiawei Han,et al.  Large-Scale Spectral Clustering on Graphs , 2013, IJCAI.

[17]  Mikhail Belkin,et al.  Consistency of spectral clustering , 2008, 0804.0678.

[18]  Hans Wackernagel,et al.  Multivariate Geostatistics: An Introduction with Applications , 1996 .

[19]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[20]  Nguyen Lu Dang Khoa,et al.  Large Scale Spectral Clustering Using Resistance Distance and Spielman-Teng Solvers , 2012, Discovery Science.

[21]  Minoru Sasaki,et al.  Spectral Clustering for a Large Data Set by Reducing the Similarity Matrix Size , 2008, LREC.

[22]  Jacques Rivoirard,et al.  Unsupervised classification of multivariate geostatistical data: Two algorithms , 2015, Comput. Geosci..

[23]  Satu Elisa Schaeffer,et al.  Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[24]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[25]  Francesco Masulli,et al.  A survey of kernel and spectral methods for clustering , 2008, Pattern Recognit..

[26]  Tie-Yan Liu,et al.  Fast Spectral Clustering of Data Using Sequential Matrix Compression , 2006, ECML.

[27]  Xianchao Zhang,et al.  Sampling for Nyström Extension-Based Spectral Clustering , 2016, ACM Trans. Knowl. Discov. Data.

[28]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[29]  Ye Tian,et al.  A Fast Incremental Spectral Clustering for Large Data Sets , 2011, 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[30]  Francky Fouedjio,et al.  A Clustering Approach for Discovering Intrinsic Clusters in Multivariate Geostatistical Data , 2016, MLDM.

[31]  Miguel Á. Carreira-Perpiñán,et al.  The Variational Nystrom method for large-scale spectral problems , 2016, ICML.

[32]  Francky Fouedjio,et al.  A hierarchical clustering method for multivariate geostatistical data , 2016 .

[33]  G. Pflug Kernel Smoothing. Monographs on Statistics and Applied Probability - M. P. Wand; M. C. Jones. , 1996 .

[34]  Xinlei Chen,et al.  Large Scale Spectral Clustering Via Landmark-Based Sparse Representation , 2015, IEEE Transactions on Cybernetics.

[35]  Xinlei Chen,et al.  Large Scale Spectral Clustering with Landmark-Based Representation , 2011, AAAI.