Clustering performance using k-modes with modified entropy measure for breast cancer

Breast cancer is a serious disease that requires data analysis for diagnosis and treatment. Clustering is a data mining technique that is often used in breast cancer research to assess the level of malignancy at an early stage. However, clustering categorical data can be challenging because different levels in categorical variables can impact the clustering process. This research proposes a modified entropy measure (MEM) to enhance clustering performance. MEM aims to address the issue of distance-based measures in clustering categorical data. It is also a useful tool for assessing data loss in categorical clustering, which helps to understand the patterns and relationships by quantifying the information lost during clustering. An evaluation compares k-modes+MEM, k-means+MEM, DBSCAN+MEM, and affinity+MEM with conventional clustering algorithms. The assessment metrics of clustering accuracy, intra-cluster distance and fowlkes-mallow index (FMI) are employed to evaluate the algorithm performance. Experimental results show significant improvements. k-Modes+MEM algorithm achieves a reduction in average intra-cluster distance and outperforms other algorithms in accuracy, intra-cluster distance, and FMI. The proposed algorithm can be extended to heterogeneous datasets in various domains such as healthcare, finance, and marketing.

[1]  H. Suhartanto,et al.  Accuracy Analysis of Deep Learning Methods in Breast Cancer Classification: A Structured Review , 2023, Diagnostics.

[2]  Neelima Pilli,et al.  An extensible framework for recurrent breast cancer prognosis using deep learning techniques , 2023, Indonesian Journal of Electrical Engineering and Computer Science.

[3]  N. Wallace,et al.  Predicting the Prognostic Value of POLI Expression in Different Cancers via a Machine Learning Approach , 2022, International journal of molecular sciences.

[4]  D. K. Sah,et al.  Study on Clinical Presentation of Breast Carcinoma of 80 Cases , 2022, East African Scholars Journal of Medical Sciences.

[5]  Neelima Pilli,et al.  A comparative study to predict breast cancer using machine learning techniques , 2022, Indonesian Journal of Electrical Engineering and Computer Science.

[6]  A. Amran,et al.  Comparison of Support Vector Machine and K-Nearest Neighbors in Breast Cancer Classification , 2022, Pattimura International Journal of Mathematics (PIJMath).

[7]  Manav Mangukiya Breast Cancer Detection with Machine Learning , 2022, International Journal for Research in Applied Science and Engineering Technology.

[8]  Jianfeng Ma,et al.  Achieving Graph Clustering Privacy Preservation Based on Structure Entropy in Social IoT , 2022, IEEE Internet of Things Journal.

[9]  Xuewei Chao,et al.  Distance-Entropy: An Effective Indicator for Selecting Informative Data , 2022, Frontiers in Plant Science.

[10]  I. K. A. Enriko,et al.  Breast cancer recurrence prediction system using k-nearest neighbor, naïve-bayes, and support vector machine algorithm , 2021, Jurnal Infotel.

[11]  Seyed Amin Seyfi Shishavan,et al.  Novel spherical fuzzy distance and similarity measures and their applications to medical diagnosis , 2021, Expert Syst. Appl..

[12]  Rameshwar Pratap,et al.  Efficient binary embedding of categorical data using BinSketch , 2021, Data Mining and Knowledge Discovery.

[13]  Chunying Zhang,et al.  MD-SPKM: A set pair k-modes clustering algorithm for incomplete categorical matrix data , 2021, Intell. Data Anal..

[14]  Rodrigo I. Silveira,et al.  A comparative analysis of trajectory similarity measures , 2021, GIScience & Remote Sensing.

[15]  Adnan Mohsin Abdulazeez,et al.  A Comparative Analysis and Predicting for Breast Cancer Detection Based on Data Mining Models , 2021, Asian Journal of Research in Computer Science.

[16]  Md. Mahbubur Rahman,et al.  Improved Mean Shift Algorithm for Maximizing Clustering Accuracy , 2021, Journal of Engineering Advancements.

[17]  Musharrat Khan,et al.  Entropy-Based Feature Selection for Data Clustering Using k-Means and k-Medoids Algorithms , 2020, 2020 Fifth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN).

[18]  Chuanfeng Zhao,et al.  HIBOG: Improving the clustering accuracy by ameliorating dataset with gravitation , 2020, Inf. Sci..

[19]  Longbing Cao,et al.  Unsupervised Heterogeneous Coupling Learning for Categorical Representation , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  K. Dorman,et al.  An efficient k‐modes algorithm for clustering categorical datasets , 2020, Stat. Anal. Data Min..

[21]  Christophe Guyeux,et al.  Introducing and Comparing Recent Clustering Methods for Massive Data Management in the Internet of Things , 2019, J. Sens. Actuator Networks.

[22]  Christoph Meinel,et al.  K-metamodes: frequency-and ensemble-based distributed k-modes clustering for security analytics , 2019, 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA).

[23]  Jitender Kumar Chhabra,et al.  Sustainable automatic data clustering using hybrid PSO algorithm with mutation , 2019, Sustain. Comput. Informatics Syst..

[24]  Zhengxin Chen,et al.  An iterative initial-points refinement algorithm for categorical data clustering , 2002, Pattern Recognit. Lett..

[25]  Jnanendra Prasad Sarkar,et al.  Supplementary Material of Machine Learning Integrated Credibilistic Semi Supervised Clustering for Categorical Data , 2019 .