Clustering big IoT data by metaheuristic optimized mini-batch and parallel partition-based DGC in Hadoop

Abstract Clustering algorithms are an important branch of data mining family which has been applied widely in IoT applications such as finding similar sensing patterns, detecting outliers, and segmenting large behavioral groups in real-time. Traditional full batch k -means for clustering IoT big data is confronted by large scaled storage and high computational complexity problems. In order to overcome the latency inherited from full batch k -means, two big data processing methods were often used: the first method is to use small batches as the input data to multiple computers for reducing the computation efforts. However, depending on the sensed data which may be heterogeneously fused from different sources in an IoT network, the size of each mini batch may vary in each iteration of clustering process. When these input data are subject to clustering their centers would shift drastically, which affects the final clustering results. The second method is parallel computing, it decreases the runtime while the overall computational effort remains the same. Furthermore, some centroid based clustering algorithm such as k -means converges easily into local optima. In light of this, in this paper, a new partitioned clustering method that is optimized by metaheuristic is proposed for IoT big data environment. The method has three main activities: Firstly, a sample of the dataset is partitioned into mini batches. It is followed by adjusting the centroids of the mini batches of data. The third step is collating the mini batches to form clusters, so the quality of the clusters would be maximized. How the positions of the centroids could be optimally attuned at the mini batches are governed by a metaheuristic called Dynamic Group Optimization. The data are processed in parallel in Hadoop. Extensive experiments are conducted to investigate the performance. The results show that our proposed method is a promising tool for clustering fused IoT data efficiently.

[1]  Anna Choromanska,et al.  Fast Spectral Clustering via the Nyström Method , 2013, ALT.

[2]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[3]  Weixin Xie,et al.  Suppressed fuzzy c-means clustering algorithm , 2003, Pattern Recognit. Lett..

[4]  Feng Bao,et al.  Evolving privacy: From sensors to the Internet of Things , 2017, Future Gener. Comput. Syst..

[5]  Alva L. Couch,et al.  Parallel K-means Clustering Algorithm on NOWs , 2003 .

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[8]  Nor Badrul Anuar,et al.  TEMPORARY REMOVAL: Information fusion in social big data: Foundations, state-of-the-art, applications, challenges, and future research directions , 2016 .

[9]  David E. Goldberg,et al.  Genetic algorithms and Machine Learning , 1988, Machine Learning.

[10]  R. Mantegna,et al.  Fast, accurate algorithm for numerical simulation of Lévy stable stochastic processes. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[11]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[12]  Marimuthu Palaniswami,et al.  Internet of Things (IoT): A vision, architectural elements, and future directions , 2012, Future Gener. Comput. Syst..

[13]  Xin-She Yang,et al.  Firefly Algorithm, Lévy Flights and Global Optimization , 2010, SGAI Conf..

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  Simon Fong,et al.  Nature-Inspired Clustering Algorithms for Web Intelligence Data , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[16]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[17]  Nikil D. Dutt,et al.  SPARK: a high-level synthesis framework for applying parallelizing compiler transformations , 2003, 16th International Conference on VLSI Design, 2003. Proceedings..

[18]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[19]  Pablo San Segundo,et al.  An improved bit parallel exact maximum clique algorithm , 2013, Optim. Lett..

[20]  Nor Badrul Anuar,et al.  The role of big data in smart city , 2016, Int. J. Inf. Manag..

[21]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[22]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[23]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[24]  Ying Wah Teh,et al.  Big data reduction framework for value creation in sustainable enterprises , 2016, Int. J. Inf. Manag..

[25]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[26]  Nilanjan Dey,et al.  Cross Entropy Method Based Hybridization of Dynamic Group Optimization Algorithm , 2017, Entropy.

[27]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[28]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[29]  Simon Fong,et al.  Integrating nature-inspired optimization algorithms to K-means clustering , 2012, Seventh International Conference on Digital Information Management (ICDIM 2012).

[30]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[31]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[32]  Thippur V. Sreenivas,et al.  Fast computation of Gaussian likelihoods using low-rank matrix approximations , 2011, 2011 IEEE Workshop on Signal Processing Systems (SiPS).

[33]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.