A Hybrid Supervised Machine Learning Classifier System for Breast Cancer Prognosis Using Feature Selection and Data Imbalance Handling Approaches

Nowadays, breast cancer is the most frequent cancer among women. Early detection is a critical issue that can be effectively achieved by machine learning (ML) techniques. Thus in this article, the methods to improve the accuracy of ML classification models for the prognosis of breast cancer are investigated. Wrapper-based feature selection approach along with nature-inspired algorithms such as Particle Swarm Optimization, Genetic Search, and Greedy Stepwise has been used to identify the important features. On these selected features popular machine learning classifiers Support Vector Machine, J48 (C4.5 Decision Tree Algorithm), Multilayer-Perceptron (a feed-forward ANN) were used in the system. The methodology of the proposed system is structured into five stages which include (1) Data Pre-processing; (2) Data imbalance handling; (3) Feature Selection; (4) Machine Learning Classifiers; (5) classifier’s performance evaluation. The dataset under this research experimentation is referred from the UCI Machine Learning Repository, named Breast Cancer Wisconsin (Diagnostic) Data Set. This article indicated that the J48 decision tree classifier is the appropriate machine learning-based classifier for optimum breast cancer prognosis. Support Vector Machine with Particle Swarm Optimization algorithm for feature selection achieves the accuracy of 98.24%, MCC = 0.961, Sensitivity = 99.11%, Specificity = 96.54%, and Kappa statistics of 0.9606. It is also observed that the J48 Decision Tree classifier with the Genetic Search algorithm for feature selection achieves the accuracy of 98.83%, MCC = 0.974, Sensitivity = 98.95%, Specificity = 98.58%, and Kappa statistics of 0.9735. Furthermore, Multilayer Perceptron ANN classifier with Genetic Search algorithm for feature selection achieves the accuracy of 98.59%, MCC = 0.968, Sensitivity = 98.6%, Specificity = 98.57%, and Kappa statistics of 0.9682.

[1]  Harikumar Rajaguru,et al.  Detection and classification of microcalcification from digital mammograms with firefly algorithm, extreme learning machine and non‐linear regression models: A comparison , 2020, Int. J. Imaging Syst. Technol..

[2]  Hatem Khater,et al.  A Composite Hybrid Feature Selection Learning-Based Optimization of Genetic Algorithm For Breast Cancer Detection , 2020 .

[3]  H. Dag,et al.  Comparison of feature selection algorithms for medical data , 2012, 2012 International Symposium on Innovations in Intelligent Systems and Applications.

[4]  C. Hicks,et al.  Unraveling the Genomic-Epigenomic Interaction Landscape in Triple Negative and Non-Triple Negative Breast Cancer , 2020, Cancers.

[5]  Tanzila Saba,et al.  Recent advancement in cancer detection using machine learning: Systematic survey of decades, comparisons and challenges. , 2020, Journal of infection and public health.

[6]  Mohammad Darzi,et al.  Feature Selection for Breast Cancer Diagnosis: A Case-Based Wrapper Approach , 2011 .

[7]  P. N. Srivastava,et al.  Performance Evaluation of Wrapper-Based Feature Selection Techniques for Medical Datasets , 2020 .

[8]  Mohamed Ghailani,et al.  Application of Data Mining Classification Algorithms for Breast Cancer Diagnosis , 2018, SCA.

[9]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[10]  Constantin Zopounidis,et al.  Feature selection algorithms in classification problems: an experimental evaluation , 2005, Optim. Methods Softw..

[11]  Li Chen,et al.  Prediction of Breast Cancer from Imbalance Respect Using Cluster-Based Undersampling Method , 2019, Journal of healthcare engineering.

[12]  Sachi Nandan Mohanty,et al.  A Hybrid Approach for Breast Cancer Classification and Diagnosis , 2018, EAI Endorsed Trans. Scalable Inf. Syst..

[13]  Mohamed Ghailani,et al.  Proposed approach for breast cancer diagnosis using machine learning , 2019, SCA.

[14]  Hamed Tabrizchi,et al.  Breast cancer diagnosis using a multi-verse optimizer-based gradient boosting decision tree , 2020, SN Applied Sciences.

[15]  P. Desai,et al.  Emerging technologies and innovation policies in India: how disparities in cancer research might be furthering health inequities? , 2018, Journal of Asian Public Policy.

[16]  Chong-Ho Choi,et al.  Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[17]  Vaibhav Mittal,et al.  Classification models for Invasive Ductal Carcinoma Progression, based on gene expression data-trained supervised machine learning , 2020, Scientific Reports.

[18]  A. Jemal,et al.  Cancer statistics, 2019 , 2019, CA: a cancer journal for clinicians.

[19]  Yongbin Yu,et al.  RMAF: Relu-Memristor-Like Activation Function for Deep Learning , 2020, IEEE Access.

[20]  Bobby D. Gerardo,et al.  Fuzzy decision tree for breast cancer prediction , 2019, AISS.

[21]  Sangeeta Gupta,et al.  Clinical presentations of carcinoma breast in rural population of North India: a prospective observational study , 2019, International Surgery Journal.

[22]  Rui Camacho,et al.  Using autoencoders as a weight initialization method on deep neural networks for disease detection , 2020, BMC Medical Informatics and Decision Making.

[23]  Michael W. Kattan,et al.  A comprehensive data level analysis for cancer diagnosis on imbalanced data , 2019, J. Biomed. Informatics.

[24]  Jian-Ping Li,et al.  A Hybrid Intelligent System Framework for the Prediction of Heart Disease Using Machine Learning Algorithms , 2018, Mob. Inf. Syst..

[25]  Amrutanshu Panigrahi,et al.  Efficient Role of Machine Learning Classifiers in the Prediction and Detection of Breast Cancer , 2020 .

[26]  Chilukuri K. Mohan,et al.  Analysis of a simple particle swarm optimization system , 1998 .

[27]  Bassam Al-Shargabi,et al.  An experimental study for breast cancer prediction algorithms , 2019, DATA.

[28]  S. Kadry,et al.  Cloud Computing-Based Framework for Breast Cancer Diagnosis Using Extreme Learning Machine , 2021, Diagnostics.

[29]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[30]  B. Mohanti,et al.  Triple-negative breast cancer: An institutional analysis. , 2014, Indian journal of cancer.

[31]  B. Prabadevi,et al.  Analysis of Machine Learning Algorithms on Cancer Dataset , 2020, 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE).