Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy

Graphical abstractDisplay Omitted HighlightsAutomatic clinical decision support system for breast cancer malignancy grading.Different methodologies for segmentation and feature extraction from FNA slides.An efficient classifier ensemble for imbalanced problems with difficult data.Ensemble combines boosting with evolutionary undersampling.Extensive computational experiments on a large database collected by authors. In this paper, we propose a complete, fully automatic and efficient clinical decision support system for breast cancer malignancy grading. The estimation of the level of a cancer malignancy is important to assess the degree of its progress and to elaborate a personalized therapy. Our system makes use of both Image Processing and Machine Learning techniques to perform the analysis of biopsy slides. Three different image segmentation methods (fuzzy c-means color segmentation, level set active contours technique and grey-level quantization method) are considered to extract the features used by the proposed classification system. In this classification problem, the highest malignancy grade is the most important to be detected early even though it occurs in the lowest number of cases, and hence the malignancy grading is an imbalanced classification problem. In order to overcome this difficulty, we propose the usage of an efficient ensemble classifier named EUSBoost, which combines a boosting scheme with evolutionary undersampling for producing balanced training sets for each one of the base classifiers in the final ensemble. The usage of the evolutionary approach allows us to select the most significant samples for the classifier learning step (in terms of accuracy and a new diversity term included in the fitness function), thus alleviating the problems produced by the imbalanced scenario in a guided and effective way. Experiments, carried on a large dataset collected by the authors, confirm the high efficiency of the proposed system, shows that level set active contours technique leads to an extraction of features with the highest discriminative power, and prove that EUSBoost is able to outperform state-of-the-art ensemble classifiers in a real-life imbalanced medical problem.

[1]  Nico Karssemeijer,et al.  Artificial Intelligence in Medicine , 2022 .

[2]  Szymon Wilk,et al.  Integrating Selective Pre-processing of Imbalanced Data with Ivotes Ensemble , 2010, RSCTC.

[3]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[4]  Francisco Herrera,et al.  Empowering difficult classes with a similarity-based aggregation in multi-class classification problems , 2014, Inf. Sci..

[5]  N. Theera-Umpon Patch-Based White Blood Cell Nucleus Segmentation Using Fuzzy Clustering , 2005 .

[6]  Syamsiah Mashohor,et al.  A review of computer assisted detection/diagnosis (CAD) in breast thermography for breast cancer detection , 2013, Artificial Intelligence Review.

[7]  Axel Wismüller,et al.  Classification of small lesions on dynamic breast MRI: Integrating dimension reduction and out-of-sample extension into CADx methodology , 2014, Artif. Intell. Medicine.

[8]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[9]  Michal Wozniak,et al.  Cost-sensitive methods of constructing hierarchical classifiers , 2010, Expert Syst. J. Knowl. Eng..

[10]  L. Sobin,et al.  Histological Typing of Breast Tumors 1 , 1982 .

[11]  María José del Jesús,et al.  A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets , 2008, Fuzzy Sets Syst..

[12]  T. W. Ridler,et al.  Picture thresholding using an iterative selection method. , 1978 .

[13]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[14]  Xin Yao,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Relationships between Diversity of Classification Ensembles and Single-class Performance Measures , 2022 .

[15]  Rached Tourki,et al.  Automated Breast Cancer Diagnosis Based on GVF-Snake Segmentation, Wavelet Features Extraction and Fuzzy Classification , 2009, J. Signal Process. Syst..

[16]  Heng-Da Cheng,et al.  Computer-aided detection and classification of microcalcifications in mammograms: a survey , 2003, Pattern Recognit..

[17]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[18]  Bartosz Krawczyk,et al.  Classifier ensemble for an effective cytological image analysis , 2013, Pattern Recognit. Lett..

[19]  Larry J. Eshelman,et al.  The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination , 1990, FOGA.

[20]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[21]  Ethem Alpaydın,et al.  Combined 5 x 2 cv F Test for Comparing Supervised Classification Learning Algorithms , 1999, Neural Comput..

[22]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[23]  Boguslaw Cyganek One-Class Support Vector Ensembles for Image Segmentation and Classification , 2011, Journal of Mathematical Imaging and Vision.

[24]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[25]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[26]  Jun Ni,et al.  An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  J. Sethian,et al.  Fronts propagating with curvature-dependent speed: algorithms based on Hamilton-Jacobi formulations , 1988 .

[28]  Rosa Maria Valdovinos,et al.  New Applications of Ensembles of Classifiers , 2003, Pattern Analysis & Applications.

[29]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[30]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[31]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[32]  G. Yule,et al.  On the association of attributes in statistics, with examples from the material of the childhood society, &c , 1900, Proceedings of the Royal Society of London.

[33]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[35]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[36]  Yi Lin,et al.  Support Vector Machines for Classification in Nonstandard Situations , 2002, Machine Learning.

[37]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[38]  José Salvador Sánchez,et al.  On the k-NN performance in a challenging scenario of imbalance and overlapping , 2008, Pattern Analysis and Applications.

[39]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[40]  Adam Krzyzak,et al.  Classification of Breast Cancer Malignancy Using Cytological Images of Fine Needle Aspiration Biopsies , 2008, Int. J. Appl. Math. Comput. Sci..

[41]  Robert P. W. Duin,et al.  Limits on the majority vote accuracy in classifier fusion , 2003, Pattern Analysis & Applications.

[42]  Bartosz Krawczyk,et al.  Cytological image analysis with firefly nuclei detection and hybrid one-class classification decomposition , 2014, Eng. Appl. Artif. Intell..

[43]  Taghi M. Khoshgoftaar,et al.  Evolutionary Sampling and Software Quality Modeling of High-Assurance Systems , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[44]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[45]  Adam Krzyzak,et al.  One-Class Classification Decomposition for Imbalanced Classification of Breast Cancer Malignancy Data , 2014, ICAISC.

[46]  H. Kashima,et al.  Roughly balanced bagging for imbalanced data , 2009 .

[47]  Roman Monczak,et al.  Computer-Aided Breast Cancer Diagnosis Based on the Analysis of Cytological Images of Fine Needle Biopsies , 2013, IEEE Transactions on Medical Imaging.

[48]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[49]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[50]  Leo Breiman,et al.  Pasting Small Votes for Classification in Large Databases and On-Line , 1999, Machine Learning.

[51]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[52]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[53]  H. Bloom,et al.  Histological Grading and Prognosis in Breast Cancer , 1957, British Journal of Cancer.

[54]  Chengqi Zhang,et al.  Graph Ensemble Boosting for Imbalanced Noisy Graph Stream Classification , 2015, IEEE Transactions on Cybernetics.

[55]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[56]  Xin Yao,et al.  Diversity creation methods: a survey and categorisation , 2004, Inf. Fusion.

[57]  Xin Yao,et al.  Resampling-Based Ensemble Methods for Online Class Imbalance Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[58]  Chunming Li,et al.  Level set evolution without re-initialization: a new variational formulation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[59]  Marek Kowal,et al.  Nuclei segmentation for computer-aided diagnosis of breast cancer , 2014, Int. J. Appl. Math. Comput. Sci..

[60]  George J. Klir,et al.  Fuzzy sets and fuzzy logic - theory and applications , 1995 .

[61]  Ying He,et al.  MSMOTE: Improving Classification Performance When Training Data is Imbalanced , 2009, 2009 Second International Workshop on Computer Science and Engineering.

[62]  Francisco Herrera,et al.  Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification , 2013, Pattern Recognit..

[63]  Kaizhu Huang,et al.  Learning Imbalanced Classifiers Locally and Globally with One-Side Probability Machine , 2014, Neural Processing Letters.

[64]  Shichao Zhang,et al.  A Strategy for Attributes Selection in Cost-Sensitive Decision Trees Induction , 2008, 2008 IEEE 8th International Conference on Computer and Information Technology Workshops.

[65]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[66]  G. Yule On the Association of Attributes in Statistics: With Illustrations from the Material of the Childhood Society, &c , 1900 .

[67]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[68]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[69]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[70]  R. W. Scarff,et al.  Histological typing of breast tumors. , 1982, Tumori.

[71]  J. Sethian,et al.  An overview of level set methods for etching, deposition, and lithography development , 1997 .

[72]  Lukasz Jelen Computerized cancer malignancy grading of fine needle aspirates , 2009 .