Big data fuzzy C-means algorithm based on bee colony optimization using an Apache Hbase

Clustering algorithm analysis, including time and space complexity analysis, has always been discussed in the literature. The emergence of big data has also created a lot of challenges for this issue. Because of high complexity and execution time, traditional clustering techniques cannot be used for such an amount of data. This problem has been addressed in this research. To present the clustering algorithm using a bee colony algorithm and high-speed read/write performance, Map-Reduce architecture is used. Using this architecture allows the proposed method to cluster any volume of data, and there is no limit to the amount of data. The presented algorithm has good performance and high precision. The simulation results on 3 datasets show that the presented algorithm is more efficient than other big data clustering methods. Also, the results of our algorithm execution time on huge datasets are much better than other big data clustering approaches.

[1]  James M. Keller,et al.  A possibilistic fuzzy c-means clustering algorithm , 2005, IEEE Transactions on Fuzzy Systems.

[2]  Rafael Sachetto Oliveira,et al.  G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering , 2013, ICCS.

[3]  François G. Meyer,et al.  Spatiotemporal clustering of fMRI time series in the spectral domain , 2005, Medical Image Anal..

[4]  Hans-Peter Kriegel,et al.  Density‐based clustering , 2011, WIREs Data Mining Knowl. Discov..

[5]  Qu Guo-qing Analysis and implementation of CLARA algorithm on clustering , 2006 .

[6]  Vipin Kumar,et al.  Parallel Multilevel series k-Way Partitioning Scheme for Irregular Graphs , 1999, SIAM Rev..

[7]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[8]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[9]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[10]  Christos Boutsidis,et al.  Random Projections for $k$-means Clustering , 2010, NIPS.

[11]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[12]  Huayu Zhang,et al.  Improved K-means algorithm based on density Canopy , 2018, Knowl. Based Syst..

[13]  David Taniar,et al.  Exception rules in association rule mining , 2008, Appl. Math. Comput..

[14]  R. Rastogi,et al.  CURE: An Efficient Clustering Algorithm for Large Databases , 1998, SIGMOD Conference.

[15]  Liu Rui,et al.  Fuzzy c-Means Clustering Algorithm , 2008 .

[16]  Ying Wah Teh,et al.  Big Data Clustering: A Review , 2014, ICCSA.

[17]  Ali Kashif Bashir,et al.  A Parallel Military-Dog-Based Algorithm for Clustering Big Data in Cognitive Industrial Internet of Things , 2021, IEEE Transactions on Industrial Informatics.

[18]  Dervis Karaboga,et al.  Artificial bee colony algorithm , 2010, Scholarpedia.

[19]  Marimuthu Palaniswami,et al.  Scalable single linkage hierarchical clustering for big data , 2013, 2013 IEEE Eighth International Conference on Intelligent Sensors, Sensor Networks and Information Processing.

[20]  Charu C. Aggarwal,et al.  An Introduction to Outlier Analysis , 2013 .

[21]  Jiangchuan Liu,et al.  Statistics and Social Network of YouTube Videos , 2008, 2008 16th Interntional Workshop on Quality of Service.

[22]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[23]  Thomas D. Nielsen,et al.  Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence: UAI '00 , 2000 .

[24]  Haoyu Tan,et al.  MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data , 2013, Frontiers of Computer Science.

[25]  Mirkin Boris,et al.  Clustering: A Data Recovery Approach , 2012 .

[26]  Rasim M. Alguliyev,et al.  Efficient algorithm for big data clustering on single machine , 2020, CAAI Trans. Intell. Technol..

[27]  D. Pham,et al.  An Incremental K-means algorithm , 2004 .

[28]  Ziv Bar-Joseph,et al.  Clustering short time series gene expression data , 2005, ISMB.

[29]  Wolfgang Kastner,et al.  Analysis of Similarity Measures in Times Series Clustering for the Discovery of Building Energy Patterns , 2013 .

[30]  P. Rousseeuw,et al.  Partitioning Around Medoids (Program PAM) , 2008 .

[31]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[32]  Juan E. Gilbert,et al.  A Clustering Rule Based Approach for Classification Problems , 2012, Int. J. Data Warehous. Min..

[33]  Birch , 2020, The Long, Long Life of Trees.

[34]  R. Vishnu Priya,et al.  User Behaviour Pattern Mining from Weblog , 2012, Int. J. Data Warehous. Min..

[35]  Chao Ma,et al.  A Succinct Distributive Big Data Clustering Algorithm Based on Local-Remote Coordination , 2015, 2015 IEEE International Conference on Systems, Man, and Cybernetics.

[36]  Erik Brynjolfsson,et al.  Big data: the management revolution. , 2012, Harvard business review.

[37]  Charu C. Aggarwal,et al.  An Introduction to Cluster Analysis , 2018, Data Clustering: Algorithms and Applications.

[38]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[39]  Yusuf Kavurucu,et al.  Hadoop Ecosystem and Its Analysis on Tweets , 2015 .

[40]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[41]  James C. Bezdek,et al.  Extending fuzzy and probabilistic clustering to very large data sets , 2006, Comput. Stat. Data Anal..

[42]  Tim Foley,et al.  KD-tree acceleration structures for a GPU raytracer , 2005, HWWS '05.

[43]  Saiful Islam,et al.  Mahalanobis Distance , 2009, Encyclopedia of Biometrics.

[44]  Nebojsa Jojic,et al.  LOCUS: learning object classes with unsupervised segmentation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[45]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[46]  Anjana Gosain,et al.  Handling class imbalance problem using oversampling techniques: A review , 2017, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[47]  Fenglou Mao,et al.  Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[48]  Jimeng Sun,et al.  Less is More: Compact Matrix Decomposition for Large Sparse Graphs , 2007, SDM.

[49]  Nianxue Luo,et al.  Parallel clustering of big data of spatio-temporal trajectory , 2015, 2015 11th International Conference on Natural Computation (ICNC).

[50]  Sujing Wang,et al.  Design of Fast and Scalable Clustering Algorithm on Spark , 2020, ICCBDC.

[51]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[52]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[53]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[54]  Zhikui Chen,et al.  A weighted kernel possibilistic c‐means algorithm based on cloud computing for clustering big data , 2014, Int. J. Commun. Syst..

[55]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[56]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[57]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition , 2006, SIAM J. Comput..

[58]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[59]  Laurence T. Yang,et al.  PPHOPCM: Privacy-Preserving High-Order Possibilistic c-Means Algorithm for Big Data Clustering with Cloud Computing , 2017, IEEE Transactions on Big Data.

[60]  Van-Hau Pham,et al.  Parallel Two-Phase K-Means , 2013, ICCSA.