Parallel k-Means Clustering for Quantitative Ecoregion Delineation Using Large Data Sets

Identification of geographic ecoregions has long been of interest to environmental scientists and ecologists for identifying regions of similar ecological and environmental conditions. Such classifications are important for predicting suitable species ranges, for stratification of ecological samples, and to help prioritize habitat preservation and remediation efforts. Hargrove and Hoffman [1, 2] have developed geographical spatio-temporal clustering algorithms and codes and have successfully applied them to a variety of environmental science domains, including ecological regionalization; environmental monitoring network design; analysis of satellite-, airborne-, and ground-based remote sensing, and climate model-model and model-measurement intercomparison. With the advances in state-of-the-art satellite remote sensing and climate models, observations and model outputs are available at increasingly high spatial and temporal resolutions. Long time series of these high resolution datasets are extremely large in size and growing. Analysis and knowledge extraction from these large datasets are not just algorithmic and ecological problems, but also pose a complex computational problem. This paper focuses on the development of a massively parallel multivariate geographical spatio-temporal clustering code for analysis of very large datasets using tens of thousands processors on one of the fastest supercomputers in the world.

[1]  W. Hargrove,et al.  Potential of Multivariate Quantitative Methods for Delineation and Visualization of Ecoregions , 2004, Environmental management.

[2]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[3]  William W. Hargrove,et al.  Using multivariate clustering to characterize ecoregion borders , 1999, Comput. Sci. Eng..

[4]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[5]  W. Hargrove,et al.  Toward a national early warning system for forest disturbances using remotely sensed canopy phenology , 2009 .

[6]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[7]  Jitendra Kumar,et al.  Geospatiotemporal data mining in an early warning system for forest threats in the United States , 2010, 2010 IEEE International Geoscience and Remote Sensing Symposium.

[8]  Vaidy S. Sunderam,et al.  PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[9]  Steven J. Phillips Acceleration of K-Means and Related Clustering Algorithms , 2002, ALENEX.

[10]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[11]  Soon Myoung Chung,et al.  Parallel bisecting k-means with prediction clustering algorithm , 2006, The Journal of Supercomputing.

[12]  Anil K. Jain,et al.  Large-Scale Parallel Data Clustering , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  William W. Hargrove,et al.  Multivariate Geographic Cluster Using a Beowulf-style Parallel Computer , 1999, PDPTA.

[14]  Steven Phillips,et al.  Reducing the computation time of the Isodata and K-means unsupervised classification algorithms , 2002, IEEE International Geoscience and Remote Sensing Symposium.