PARSUC: A Parallel Subsampling-Based Method for Clustering Remote Sensing Big Data

Remote sensing big data (RSBD) is generally characterized by huge volumes, diversity, and high dimensionality. Mining hidden information from RSBD for different applications imposes significant computational challenges. Clustering is an important data mining technique widely used in processing and analyzing remote sensing imagery. However, conventional clustering algorithms are designed for relatively small datasets. When applied to problems with RSBD, they are, in general, too slow or inefficient for practical use. In this paper, we proposed a parallel subsampling-based clustering (PARSUC) method for improving the performance of RSBD clustering in terms of both efficiency and accuracy. PARSUC leverages a novel subsampling-based data partitioning (SubDP) method to realize three-step parallel clustering, effectively solving the notable performance bottleneck of the existing parallel clustering algorithms; that is, they must cope with numerous repeated calculations to get a reasonable result. Furthermore, we propose a centroid filtering algorithm (CFA) to eliminate subsampling errors and to guarantee the accuracy of the clustering results. PARSUC was implemented on a Hadoop platform by using the MapReduce parallel model. Experiments conducted on massive remote sensing imageries with different sizes showed that PARSUC (1) provided much better accuracy than conventional remote sensing clustering algorithms in handling larger image data; (2) achieved notable scalability with increased computing nodes added; and (3) spent much less time than the existing parallel clustering algorithm in handling RSBD.

[1]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[2]  Jing Zhang,et al.  A Parallel K-Means Clustering Algorithm with MPI , 2011, 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming.

[3]  Ying Wah Teh,et al.  Big Data Clustering: A Review , 2014, ICCSA.

[4]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[5]  Zhenlong Li,et al.  Big Data and cloud computing: innovation opportunities and challenges , 2017, Int. J. Digit. Earth.

[6]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[7]  R. Prim Shortest connection networks and some generalizations , 1957 .

[8]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[9]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[10]  Lawrence O. Hall,et al.  A scalable framework for cluster ensembles , 2009, Pattern Recognit..

[11]  Christos Boutsidis,et al.  Random Projections for $k$-means Clustering , 2010, NIPS.

[12]  Nadia Essoussi,et al.  One-pass MapReduce-based clustering method for mixed large scale data , 2019, Journal of Intelligent Information Systems.

[13]  Albert Y. Zomaya,et al.  Remote sensing big data computing: Challenges and opportunities , 2015, Future Gener. Comput. Syst..

[14]  Howard J. Hamilton,et al.  DBRS: A Density-Based Spatial Clustering Method with Random Sampling , 2003, PAKDD.

[15]  Chris H. Q. Ding,et al.  Adaptive dimension reduction for clustering high dimensional data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[16]  Min Luo,et al.  Bootstrapping K-means for big data analysis , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[17]  Younghoon Kim,et al.  DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce , 2014, Inf. Syst..

[18]  Kang-Woo Lee,et al.  High-Performance Geospatial Big Data Processing System Based on MapReduce , 2018, ISPRS Int. J. Geo Inf..

[19]  Saeed Jalili,et al.  Single-pass and linear-time k-means clustering based on MapReduce , 2016, Inf. Syst..

[20]  Anastasios Tefas,et al.  A distributed framework for trimmed Kernel k-Means clustering , 2015, Pattern Recognit..

[21]  Ian Davidson,et al.  Speeding up k-means Clustering by Bootstrap Averaging , 2003 .

[22]  David M. Rocke,et al.  Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data , 2003, Data Mining and Knowledge Discovery.

[23]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[24]  Ujjwal Maulik,et al.  Efficient parallel algorithm for pixel classification in remote sensing imagery , 2012, GeoInformatica.

[25]  Xuan Shi,et al.  Parallelizing ISODATA Algorithm for Unsupervised Image Classification on GPU , 2013 .

[26]  Hassan A. Karimi,et al.  Parallel implementation of Kaufman's initialization for clustering large remote sensing images on clouds , 2017, Comput. Environ. Urban Syst..

[27]  Zhenlong Li,et al.  Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial Data , 2016, ISPRS Int. J. Geo Inf..

[28]  Jae-Gil Lee,et al.  Geospatial Big Data: Challenges and Opportunities , 2015, Big Data Res..

[29]  Zhenhong Du,et al.  ParSymG: a parallel clustering approach for unsupervised classification of remotely sensed imagery , 2017, Int. J. Digit. Earth.

[30]  Bo Li,et al.  Parallel K-Means Clustering of Remote Sensing Images Based on MapReduce , 2010, WISM.

[31]  Yansheng Li,et al.  Multiple Feature Hashing Learning for Large-Scale Remote Sensing Image Retrieval , 2017, ISPRS Int. J. Geo Inf..

[32]  Qian Du,et al.  Remote Sensing Big Data: Theory, Methods and Applications , 2018, Remote. Sens..

[33]  Ying Wah Teh,et al.  Iterative big data clustering algorithms: a review , 2016, Softw. Pract. Exp..

[34]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[35]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[36]  D. Vanderzee,et al.  Sensitivity of ISODATA to changes in sampling procedures and processing parameters when applied to AVHRR time-series NDV1 data , 1995 .