An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data

BACKGROUND Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids. METHOD We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids. RESULTS We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others. CONCLUSION There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data.

[1]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[2]  Lutgarde M. C. Buydens,et al.  Self- and Super-organizing Maps in R: The kohonen Package , 2007 .

[3]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[4]  A. Goldenberg,et al.  Intertumoral Heterogeneity within Medulloblastoma Subgroups. , 2017, Cancer cell.

[5]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[6]  Sudipta Acharya,et al.  Multiobjective Simulated Annealing-Based Clustering of Tissue Samples for Cancer Diagnosis , 2016, IEEE Journal of Biomedical and Health Informatics.

[7]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[8]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[9]  Steven J. M. Jones,et al.  Integrated genomic characterization of endometrial carcinoma , 2013, Nature.

[10]  Xiaogang Wang,et al.  Clues: an R Package for Nonparametric Clustering Based on Local Shrinking , 2022 .

[11]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[12]  Xv Lan,et al.  Density K-means: A new algorithm for centers initialization for K-means , 2015, 2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS).

[13]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[14]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[15]  Pradipta Maji,et al.  $f$-Information Measures for Efficient Selection of Discriminative Genes From Microarray Data , 2009, IEEE Transactions on Biomedical Engineering.

[16]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[17]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[18]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[19]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[20]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[21]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[22]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[23]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[24]  V. Amani,et al.  Interleukin-6/STAT3 Pathway Signaling Drives an Inflammatory Phenotype in Group A Ependymoma , 2015, Cancer Immunology Research.

[25]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[26]  Habibollah Haron,et al.  Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Sanghamitra Bandyopadhyay,et al.  Gene expression data clustering using a multiobjective symmetry based clustering technique , 2013, Comput. Biol. Medicine.

[28]  L. Hubert,et al.  Comparing partitions , 1985 .

[29]  S. Das,et al.  Cancer classification through feature selection and transductive SVM using gene microarray data , 2012, 2012 Third International Conference on Emerging Applications of Information Technology.

[30]  Johan Staaf,et al.  Gene Expression Profiling of Large Cell Lung Cancer Links Transcriptional Phenotypes to the New Histological WHO 2015 Classification , 2017, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[31]  Saeed El-Ashram,et al.  Clustering by fast search and merge of local density peaks for gene expression microarray data , 2017, Scientific Reports.

[32]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[33]  Eric Klein,et al.  Adult Urology Oncology : Adrenal / Renal / Upper Tract / Bladder A Genomic Algorithm for the Molecular Classification of Common Renal Cortical Neoplasms : Development and Validation , 2015 .

[34]  Maguelonne Teisseire,et al.  Mining microarray data to predict the histological grade of a breast cancer , 2011, J. Biomed. Informatics.

[35]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[36]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[37]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .