The study of under- and over-sampling methods' utility in analysis of highly imbalanced data on osteoporosis

Abstract Osteoporosis is a frequent bone disease without typical early symptoms but with serious complications e.g. low-energy bone fractures. Patients with risk factors should be screened for proper diagnosis as early as possible. Unfortunately, the registered medical data are often highly imbalanced. That is why the machine-based data processing is difficult or even impossible. Considering this, our goal was to search for the best method of coping with the problem of imbalancing in relation to the analysed data regarding the osteoporotic patients. Therefore, we checked several paradigms of classifiers in synergy with preprocessing techniques to address the inner skewed class distribution of the data. In the source dataset 92.6% of instances related to patients without any fractures (negative cases) and only 7.41% to patients (positive cases) who reported at least one fracture. To alleviate class imbalance there were examined not only data-level methods which in fact modify the input dataset, but also ensemble ones that strengthen the results of the base algorithms. In the first group the under- and over-sampling methods were used, such as random undersampling, edited nearest neighbours and synthetic minority over-sampling techniques, while in the second one – a range of methods based on various subsets of training data were analysed. Also various combinations of the above mentioned were investigated. Additionally, we propose the way how to find the balancing level which, without excessive distortion of the input, raw data, will give the appropriate classification efficiency. The aim of our experiment was to identify which of an undersampling or an oversampling approach with reference to the simple and the ensemble-based classifiers allows to achieve the best results. The outcomes of the comparative studies concerning imbalancing problem with regard to our dataset showed that the highest efficiency was achieved while using the synthetic minority over-sampling technique and RandomForest classifier. As far as the optimal balancing level is concerned, we empirically determined that 300% oversampling with the synthetic minority over-sampling method combined with edited nearest neighbours undersampling allowed to gain the required precision of classification.

[1]  Nitesh V. Chawla,et al.  Learning from Imbalanced Data: Evaluation Matters , 2012 .

[2]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[3]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[4]  Juan José Rodríguez Diez,et al.  Ensembles of Decision Trees for Imbalanced Data , 2011, MCS.

[5]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[6]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[7]  Juan José Rodríguez Diez,et al.  Diversity techniques improve the performance of the best imbalance learning ensembles , 2015, Inf. Sci..

[8]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[9]  Nicolás García-Pedrajas,et al.  Class Imbalance Methods for Translation Initiation Site Recognition , 2010, IEA/AIE.

[10]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Mohammad Khalilia,et al.  Predicting disease risks from highly imbalanced data using random forest , 2011, BMC Medical Informatics Decis. Mak..

[13]  Stephen C. Ekker,et al.  Mojo Hand, a TALEN design tool for genome editing applications , 2013, BMC Bioinformatics.

[14]  Der-Chiang Li,et al.  A learning method for the class imbalance problem with medical data sets , 2010, Comput. Biol. Medicine.

[15]  Alicja Wakulicz-Deja,et al.  Hybrid approach to the generation of medical guidelines for insulin therapy for children , 2017, Inf. Sci..

[16]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[17]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007 .

[18]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..

[19]  Albert Y. Zomaya,et al.  Ensemble-Based Wrapper Methods for Feature Selection and Class Imbalance Learning , 2013, PAKDD.

[20]  James J. Chen,et al.  Class-imbalanced classifiers for high-dimensional data , 2013, Briefings Bioinform..

[21]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[22]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[23]  Francisco Herrera,et al.  Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets , 2016, Inf. Sci..

[24]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[25]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007 .

[26]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[27]  Simon Fong,et al.  An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets , 2013, DaEng.

[28]  Rok Blagus,et al.  Improved shrunken centroid classifiers for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[29]  Wenjia Wang,et al.  Hybrid Data Mining Ensemble for Predicting Osteoporosis Risk , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[30]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[31]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[32]  A. Woolf,et al.  Burden of major musculoskeletal conditions. , 2003, Bulletin of the World Health Organization.

[33]  Rok Blagus,et al.  SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[34]  Maciej Zięba Zespoły klasyfikatorów SVM dla danych niezbalansowanych , 2013 .

[35]  E. Lewiecki,et al.  Osteoporosis , 2011, Annals of Internal Medicine.

[36]  Huaxiang Zhang,et al.  RWO-Sampling: A random walk over-sampling approach to imbalanced data classification , 2014, Inf. Fusion.

[37]  Hongyuan Wang,et al.  New Fuzzy Support Vector Machine for the Class Imbalance Problem in Medical Datasets Classification , 2014, TheScientificWorldJournal.

[38]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[39]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[40]  Piotr Adamczyk,et al.  Epidemiological data on osteoporosis in women from the RAC-OST-POL study. , 2012, Journal of clinical densitometry : the official journal of the International Society for Clinical Densitometry.

[41]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[42]  Sotiris B. Kotsiantis,et al.  Combining bagging, boosting, rotation forest and random subspace methods , 2011, Artificial Intelligence Review.

[43]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[44]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[45]  Rok Blagus,et al.  Class prediction for high-dimensional class-imbalanced data , 2010, BMC Bioinformatics.

[46]  C. Cooper,et al.  Osteoporosis in the European Union: medical management, epidemiology and economic burden , 2013, Archives of Osteoporosis.

[47]  Juan José Rodríguez Diez,et al.  Random Balance: Ensembles of variable priors classifiers for imbalanced data , 2015, Knowl. Based Syst..

[48]  Tatsuya Akutsu,et al.  Integer programming-based method for grammar-based tree compression and its application to pattern extraction of glycan tree structures , 2010, BMC Bioinformatics.

[49]  Ning Wang,et al.  1 Sensitivity , Specificity , Accuracy , Associated Confidence Interval and ROC Analysis with Practical SAS , 2010 .

[50]  Jordan M. Malof,et al.  The effect of class imbalance on case selection for case-based classifiers: An empirical study in the context of medical decision support , 2012, Neural Networks.

[51]  Prachi Goyal,et al.  A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules , 2010 .

[52]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[53]  Deok Won Kim,et al.  Osteoporosis risk prediction using machine learning and conventional methods , 2013, 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[54]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[55]  Jemal H. Abawajy,et al.  Proceedings of the First International Conference on Advanced Data and Information Engineering, DaEng 2013, Kuala Lumpur, Malaysia, December 16-18, 2013 , 2014, DaEng.

[56]  J. Choi,et al.  Osteoporosis Risk Prediction for Bone Mineral Density Assessment of Postmenopausal Women Using Machine Learning , 2013, Yonsei medical journal.

[57]  Theodoros Iliou,et al.  Osteoporosis Detection Using Machine Learning Techniques and Feature Selection , 2014, Int. J. Artif. Intell. Tools.

[58]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[59]  Balázs Kégl,et al.  MULTIBOOST: A Multi-purpose Boosting Package , 2012, J. Mach. Learn. Res..

[60]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[61]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .