Cost-Sensitive Extreme Gradient Boosting for Imbalanced Classification of Breast Cancer Diagnosis

The clinical information can enhance the doctors for predicting and diagnosing the diseases also making the right decisions. Breast cancer is the most dangerous disease, early diagnosis can improve a chance of survival and can support clinical treatment. Detecting breast cancer takes a lot of time and it is hard to classification. However, the problem of the classification occurs when there is an unequal distribution of classes the dataset. This is caused by the low performance in the traditional machine learning models. For this reason, this work proposed the cost-sensitive XGBoost model, which is an improved version of the XGBoost model in conjunction with cost-sensitive learning. The models were applied to classify the four breast cancer datasets that contained the imbalanced data. In the experiment, this work determined the best parameters on each dataset by the hyperparameters optimization technique before configuring the models. The results indicated that the cost-sensitive XGBoost model had been skillful, and could improve classification accuracy in four datasets. In addition, this work evaluated the model performance by accuracy, ROC AUC, and k- Fold cross-validation to ensure that the new models is accurate.

[1]  Mohammed Amine Chikh,et al.  Medical imbalanced data classification , 2017 .

[2]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[3]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[4]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[5]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[6]  A. Brintha Therese,et al.  Detection of Cancer in Lung with K-NN Classification Using Genetic Algorithm , 2015 .

[7]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[8]  Mahit Kumar Paul,et al.  A Gaussian mixture based boosted classification scheme for imbalanced and oversampled data , 2017, 2017 International Conference on Electrical, Computer and Communication Engineering (ECCE).

[9]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[10]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[11]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[12]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[13]  Hajar Mousannif,et al.  Using Machine Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis , 2016, ANT/SEIT.

[14]  José Martínez Sotoca,et al.  An analysis of how training data complexity affects the nearest neighbor classifiers , 2007, Pattern Analysis and Applications.

[15]  Nidhi Mishra,et al.  Breast cancer diagnosis using adaptive voting ensemble machine learning algorithm , 2018, 2018 IEEMA Engineer Infinite Conference (eTechNxT).

[16]  L. Breiman Arcing the edge , 1997 .