A multidisciplinary ensemble algorithm for clustering heterogeneous datasets

Clustering is a commonly used method for exploring and analysing data where the primary objective is to categorise observations into similar clusters. In recent decades, several algorithms and methods have been developed for analysing clustered data. We notice that most of these techniques deterministically define a cluster based on the value of the attributes, distance, and density of homogenous and single-featured datasets. However, these definitions are not successful in adding clear semantic meaning to the clusters produced. Evolutionary operators and statistical and multidisciplinary techniques may help in generating meaningful clusters. Based on this premise, we propose a new evolutionary clustering algorithm (ECA*) based on social class ranking and meta-heuristic algorithms for stochastically analysing heterogeneous and multifeatured datasets. The ECA* is integrated with recombinational evolutionary operators, Levy flight optimisation, and some statistical techniques, such as quartiles and percentiles, as well as the Euclidean distance of the K-means algorithm. Experiments are conducted to evaluate the ECA* against five conventional approaches: K-means (KM), K-means++ (KM++), expectation maximisation (EM), learning vector quantisation (LVQ), and the genetic algorithm for clustering++ (GENCLUST++). That the end, 32 heterogeneous and multifeatured datasets are used to examine their performance using internal and external and basic statistical performance clustering measures and to measure how their performance is sensitive to five features of these datasets (cluster overlap, the number of clusters, cluster dimensionality, the cluster structure, and the cluster shape) in the form of an operational framework. The results indicate that the ECA* surpasses its counterpart techniques in terms of the ability to find the right clusters. Significantly, compared to its counterpart techniques, the ECA* is less sensitive to the five properties of the datasets mentioned above. Thus, the order of overall performance of these algorithms, from best performing to worst performing, is the ECA*, EM, KM++, KM, LVQ, and the GENCLUST++. Meanwhile, the overall performance rank of the ECA* is 1.1 (where the rank of 1 represents the best performing algorithm and the rank of 6 refers to the worst performing algorithm) for 32 datasets based on the five dataset features mentioned above.

[1]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[2]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[3]  Vito Di Gesù,et al.  GenClust: A genetic algorithm for clustering gene expression data , 2005, BMC Bioinformatics.

[4]  Michael W. Kraus,et al.  Social class rank, essentialism, and punitive judgment. , 2013, Journal of personality and social psychology.

[5]  Azlan Mohd Zain,et al.  Levy Flight Algorithm for Optimization Problems - A Literature Review , 2013, ICIT 2013.

[6]  Soran Saeed,et al.  Evaluating e-Government Services in Kurdistan Institution for Strategic Studies and Scientific Research Using the EGOVSAT Model , 2016, ArXiv.

[7]  Michele Piana,et al.  A Hybrid Supervised/Unsupervised Machine Learning Approach to Solar Flare Prediction , 2017, ArXiv.

[8]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[9]  Tarik A. Rashid,et al.  Datasets on statistical analysis and performance evaluation of backtracking search optimisation algorithm compared with its counterpart algorithms , 2019, Data in brief.

[10]  Xianda Zhang,et al.  A genetic algorithm with gene rearrangement for K-means clustering , 2009, Pattern Recognit..

[11]  Simon Fong,et al.  Towards Enhancement of Performance of K-Means Clustering Using Nature-Inspired Optimization Algorithms , 2014, TheScientificWorldJournal.

[12]  Pinar Çivicioglu,et al.  Backtracking Search Optimization Algorithm for numerical optimization problems , 2013, Appl. Math. Comput..

[13]  Pasi Fränti,et al.  Genetic algorithm with deterministic crossover for vector quantization , 2000, Pattern Recognit. Lett..

[14]  Amit Kumar Das,et al.  A Short Review on Different Clustering Techniques and Their Applications , 2019, Advances in Intelligent Systems and Computing.

[15]  Pasi Fränti,et al.  Centroid index: Cluster level similarity measure , 2014, Pattern Recognit..

[16]  Teuvo Kohonen,et al.  Learning vector quantization , 1998 .

[17]  Md Zahidul Islam,et al.  Combining K-Means and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering , 2018, Expert Syst. Appl..

[18]  Thomas Seidl,et al.  Using internal evaluation measures to validate the quality of diverse stream clustering algorithms , 2017, Vietnam Journal of Computer Science.

[19]  Md Zahidul Islam,et al.  A hybrid clustering technique combining a novel genetic algorithm with K-Means , 2014, Knowl. Based Syst..

[20]  Pasi Fränti,et al.  K-means properties on six clustering benchmark datasets , 2018, Applied Intelligence.

[21]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[22]  Feng Zou,et al.  Backtracking search optimization algorithm based on knowledge learning , 2019, Inf. Sci..

[23]  Bryar A. Hassan,et al.  An Optimized Framework to Adopt Computer Laboratory Administrations for Operating System and Application Installations , 2017, ArXiv.

[24]  Edwin Lughofer A dynamic split-and-merge approach for evolving cluster models , 2012, Evol. Syst..

[25]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[26]  Gábor J. Székely,et al.  Hierarchical Clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method , 2005, J. Classif..

[27]  Tarik A. Rashid,et al.  Operational framework for recent advances in backtracking search optimisation algorithm: A systematic review and performance evaluation , 2019, Appl. Math. Comput..

[28]  Meena Mahajan,et al.  The planar k-means problem is NP-hard , 2012, Theor. Comput. Sci..

[29]  Seyyed Majid Mazinani,et al.  Presenting a New Clustering Algorithm by Combining Intelligent Bat and Chaotic Map Algorithms to Improve Energy Consumption in Wireless Sensor Network , 2019 .