Breast Cancer Prediction using Feature Selection and Ensemble Voting

Breast cancer is the most common cause of cancer among women worldwide. This paper analyses the performance of supervised and unsupervised models for breast cancer classification. Data from Wisconsin Breast Cancer Dataset is used in this paper. Feature selection is processed through scaling and principal component analysis. Final results indicate that Ensemble Voting approach is ideal as a predictive model for breast cancer. The raw data has 569 cases of breast cancer. The data is split into training and testing sets in the ration 70:30, respectively. The benchmark model is then created using Random Forest method. Various models are trained and tested on the data after Feature Scaling and Principle Component Analysis. Cross-validation is performed which showed that our model is stable. Among all the evaluated models, only four models, i.e., Ensemble - Voting Classifier, Logistics Regression, SVM Tuning and AdaBoost returned with accuracy of at least 98%. Based on results of the precision and recall, ROC-AVC, Fl-measure and computational time of the models, the Ensemble showed the most potential in breast cancer classification of the given dataset.

[1]  Chiehfeng Chen,et al.  Lovastatin lowers the risk of breast cancer: a population-based study using logistic regression with a random effects model , 2016, SpringerPlus.

[2]  Mohamad Ivan Fanany,et al.  EEG channels reduction using PCA to increase XGBoost’s accuracy for stroke detection , 2017 .

[3]  Ya-Wen Yu,et al.  Construction the Model on the Breast Cancer Survival Analysis Use Support Vector Machine, Logistic Regression and Decision Tree , 2014, Journal of Medical Systems.

[4]  Chih-Fong Tsai,et al.  SVM and SVM Ensembles in Breast Cancer Prediction , 2017, PloS one.

[5]  Siddhant Gada,et al.  Triple-Technique Diagnosis Using Machine Learned Classifiers , 2017 .

[6]  Binh P. Nguyen,et al.  Robust Biometric Recognition From Palm Depth Images for Gloved Hands , 2015, IEEE Transactions on Human-Machine Systems.

[7]  Abien Fred Agarap On breast cancer detection: an application of machine learning algorithms on the wisconsin diagnostic dataset , 2017, ICMLSC '18.

[8]  James Bailey,et al.  A voting approach to identify a small number of highly predictive genes using multiple classifiers , 2009, BMC Bioinformatics.

[9]  Sim Heng Ong,et al.  Automated brain tumor segmentation using kernel dictionary learning and superpixel-level features , 2016, 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[10]  Chee-Kong Chui,et al.  An Automated Framework for Multi-label Brain Tumor Segmentation based on Kernel Sparse Representation , 2017 .

[11]  Binh P. Nguyen,et al.  Superpixel-based segmentation of muscle fibers in multi-channel microscopy , 2016, BMC Systems Biology.

[12]  Aki Vehtari,et al.  Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC , 2015, Statistics and Computing.

[13]  Kyoungmi Kim,et al.  A Jackknife and Voting Classifier Approach to Feature Selection and Classification , 2011, Cancer informatics.

[14]  Amit Gupta,et al.  Study and Analysis of Breast Cancer Cell Detection using Naïve Bayes, SVM and Ensemble Algorithms , 2016 .

[15]  L. V. Nandakishore,et al.  KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCER , 2011 .

[16]  Binh P. Nguyen,et al.  Reworking Multilabel Brain Tumor Segmentation: An Automated Framework Using Structured Kernel Sparse Representation , 2017, IEEE Systems, Man, and Cybernetics Magazine.

[17]  Sonal Jain,et al.  Analysis of k-means clustering approach on the breast cancer Wisconsin dataset , 2016, International Journal of Computer Assisted Radiology and Surgery.

[18]  S. González,et al.  A logistic regression model predicting high axillary tumour burden in early breast cancer patients , 2017, Clinical and Translational Oncology.