论文信息 - The implementation of k-means partitioning algorithm in HOPACH clustering method

The implementation of k-means partitioning algorithm in HOPACH clustering method

Hierarchical Ordered Partitioning And Collapsing Hybrid (HOPACH) is one of the powerful clustering methods which combine the strengths of partitioning and agglomerative clustering methods. Several partition clustering methods such as PAM, K-Means, SOM, or other partitioning algorithms can be used in the partitioning process. This process is followed by the ordering steps, then continued with the agglomerative process. The number of main clusters is determined by MSS (Mean Split Silhouette) value. MSS is used to measure the heterogeneity of the clustering result. The lower the MSS value, the more homogenous each cluster members. We select the number of clusters from the clustering results with minimum MSS. In this implementation of HOPACH, we incorporate k-Means partitioning algorithm in this HOPACH clustering method, to cluster and analyze 136 DNA sequences of Ebola viruses. The clustering process is started with collecting DNA sequences of Ebola viruses from GenBank, then followed by performing features extraction of these DNA sequences using N-Mers frequency. The extraction results are compiled to be a features matrix and normalized using the min-max normalization with the interval [0, 1] as an input data to generate genetic distance matrix using Euclidian distance. The genetic distance matrix is used in partitioning process by the K-Means algorithm in HOPACH clustering. As the results, we obtained 8 clusters with minimum MSS (Mean Split Silhouette) 0.50266. The clustering process in this article uses the open source program R.

Alhadi Bustamam | Dipo Aldila | K. R. Adzima

[1] Bhaskar Mondal,et al. A Comparative Study on K Means and PAM Algorithm using Physical Characters of Different Varieties of Mango in India , 2013 .

[2] Alhadi Bustamam,et al. Application of hierarchical clustering ordered partitioning and collapsing hybrid in Ebola Virus phylogenetic analysis , 2015, 2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS).

[3] Richard A. Johnson,et al. Applied Multivariate Statistical Analysis , 1983 .

[4] Mark J. van der Laan,et al. Cluster Analysis of Genomic Data with Applications in R , 2005 .

[5] Sebastian Deorowicz,et al. KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[6] Charu C. Aggarwal,et al. Data Clustering , 2013 .

[7] Mark J. van der Laan,et al. A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap , 2003 .

[8] Jiawei Han,et al. Data Mining: Concepts and Techniques , 2000 .