Medical decision support system for extremely imbalanced datasets

Abstract Advanced biomedical instruments and data acquisition techniques generate large amount of physiological data. For accurate diagnosis of related pathology, it has become necessary to develop new methods for analyzing and understanding this data. Clinical decision support systems are designed to provide real time guidance to healthcare experts. These are evolving as an alternate strategy to increase the exactness of diagnostic testing. Generalization ability of these systems is governed by the characteristics of dataset used during its development. It is observed that sub pathologies have a much varied ratio of occurrence in the population, making the dataset extremely imbalanced. This problem can be resolved at both levels i.e. at data level as well as algorithmic level. This work proposes a synthetic sampling technique to balance dataset along with Modified Particle Swarm Optimization (M-PSO) technique. A comparative study of multiclass support vector machine (SVM) classifier optimization algorithm based on grid selection (GSVM), hybrid feature selection (SVMFS), genetic algorithm (GA) and M-PSO is presented in this work. Empirical analysis of five machine learning algorithms demonstrate that M-PSO statistically outperforms the others.

[1]  Lin Li,et al.  Experimental Comparisons of Multi-class Classifiers , 2015, Informatica.

[2]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[3]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[4]  Hung-Yi Lin,et al.  Efficient classifiers for multi-class classification problems , 2012, Decis. Support Syst..

[5]  Pei-Chann Chang,et al.  An attribute weight assignment and particle swarm optimization algorithm for medical database classifications , 2012, Comput. Methods Programs Biomed..

[6]  Bartosz Krawczyk,et al.  Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets , 2016, Pattern Recognit..

[7]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[8]  Ashok Ghatol,et al.  Feature selection for medical diagnosis : Evaluation for cardiovascular diseases , 2013, Expert Syst. Appl..

[9]  Juan José Rodríguez Diez,et al.  Diversity techniques improve the performance of the best imbalance learning ensembles , 2015, Inf. Sci..

[10]  José Salvador Sánchez,et al.  On the suitability of resampling techniques for the class imbalance problem in credit scoring , 2013, J. Oper. Res. Soc..

[11]  P. K. Anooj,et al.  Clinical decision support system: Risk level prediction of heart disease using weighted fuzzy rules , 2012, J. King Saud Univ. Comput. Inf. Sci..

[12]  Seyda Ertekin,et al.  Adaptive Oversampling for Imbalanced Data Classification , 2013, ISCIS.

[13]  Martti Juhola,et al.  Genetic Algorithm Based Approach in Attribute Weighting for a Medical Data Set , 2014 .

[14]  Alex Alexandridis,et al.  A medical diagnostic tool based on radial basis function classifiers and evolutionary simulated annealing , 2014, J. Biomed. Informatics.

[15]  Yang Liu,et al.  Combining integrated sampling with SVM ensembles for learning from imbalanced datasets , 2011, Inf. Process. Manag..

[16]  João Paulo Teixeira,et al.  Jitter, Shimmer and HNR Classification within Gender, Tones and Vowels in Healthy Voices , 2014 .

[17]  Prabir Bhattacharya,et al.  An evolutionary framework for detecting protein conformation defects , 2014, Inf. Sci..

[18]  Gülay Tezel,et al.  A genetic algorithm-support vector machine method with parameter optimization for selecting the tag SNPs , 2013, J. Biomed. Informatics.

[19]  Yannis Stylianou,et al.  On combining information from modulation spectra and mel-frequency cepstral coefficients for automatic detection of pathological voices , 2011, Logopedics, phoniatrics, vocology.

[20]  Francisco Herrera,et al.  Empowering difficult classes with a similarity-based aggregation in multi-class classification problems , 2014, Inf. Sci..

[21]  Xibei Yang,et al.  Recognition of Multiple Imbalanced Cancer Types Based on DNA Microarray Data Using Ensemble Classifiers , 2013, BioMed research international.

[22]  Francisco Herrera,et al.  Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets , 2016, Inf. Sci..

[23]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[24]  Timothy Masters,et al.  Practical neural network recipes in C , 1993 .

[25]  Ioannis A. Kakadiaris,et al.  Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach , 2014, ICANN.

[26]  Ratna Babu Chinnam,et al.  mr2PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification , 2011, Inf. Sci..

[27]  Yi-Zeng Hsieh,et al.  A PSO-based rule extractor for medical diagnosis , 2014, J. Biomed. Informatics.

[28]  Mengjie Zhang,et al.  Particle Swarm Optimization for Feature Selection in Classification: A Multi-Objective Approach , 2013, IEEE Transactions on Cybernetics.

[29]  Jiming Liu,et al.  Learning to improve medical decision making from imbalanced data without a priori cost , 2014, BMC Medical Informatics and Decision Making.

[30]  Muhammad Ghulam,et al.  Voice Pathology Detection Using Multiresolution Technique , 2014, 2014 European Modelling Symposium.

[31]  Giuliano Armano,et al.  A direct measure of discriminant and characteristic capability for classifier building and assessment , 2015, Inf. Sci..

[32]  Diane J. Cook,et al.  RACOG and wRACOG: Two Probabilistic Oversampling Techniques , 2015, IEEE Transactions on Knowledge and Data Engineering.

[33]  Sungzoon Cho,et al.  Constructing a multi-class classifier using one-against-one approach with different binary classifiers , 2015, Neurocomputing.

[34]  Eduardo Lleida,et al.  Voice Pathology Detection on the Saarbrücken Voice Database with Calibration and Fusion of Scores Using MultiFocal Toolkit , 2012, IberSPEECH.

[35]  Marcos Faúndez-Zanuy,et al.  Robust and complex approach of pathological speech signal analysis , 2015, Neurocomputing.

[36]  Eduardo Lleida,et al.  Score Level versus Audio Level Fusion for Voice Pathology Detection on the Saarbrücken Voice Database , 2012, IberSPEECH.

[37]  Witold Pedrycz,et al.  Dual autoencoders features for imbalance classification problem , 2016, Pattern Recognit..

[38]  Francisco Herrera,et al.  Addressing imbalanced classification with instance generation techniques: IPADE-ID , 2014, Neurocomputing.

[39]  Phayung Meesad,et al.  A highly accurate firefly based algorithm for heart disease prediction , 2015, Expert Syst. Appl..

[40]  Swati Shilaskar,et al.  Feature enhancement for classifier optimization and dimensionality reduction , 2014, 2014 Annual IEEE India Conference (INDICON).

[41]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[42]  Francisco Herrera,et al.  An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes , 2011, Pattern Recognit..

[43]  Xi Zhang,et al.  Learning Classifiers from Synthetic Data Using a Multichannel Autoencoder , 2015, ArXiv.

[44]  Francisco Charte,et al.  Addressing imbalance in multilabel classification: Measures and random resampling algorithms , 2015, Neurocomputing.

[45]  Richard Weber,et al.  Simultaneous feature selection and classification using kernel-penalized support vector machines , 2011, Inf. Sci..

[46]  Germán Castellanos-Domínguez,et al.  An improved method for voice pathology detection by means of a HMM-based feature space transformation , 2010, Pattern Recognit..