Load Balancing for Imbalanced Data Sets: Classifying Scientific Artefacts for Evidence Based Medicine

Data skewness is a challenge encountered, in particular, when applying supervised machine learning approaches in various domains, such as in healthcare and biomedical information engineering. Evidence Based Medicine (EBM) is a clinical strategy for prescribing treatment based on current best evidence for individual patients. Clinicians need to query publication repositories in order to find the best evidence to support their decision-making processes. This sophisticated information is materialised in the form of scientific artefacts in scholarly publications and the automatic extraction of these artefacts is a technical challenge for current generic search engines. Many classification approaches have been proposed for identifying key scientific artefacts in EBM, however their performance is affected by the imbalanced characteristic of data in this domain. In this paper, we present four data balancing approaches applied in a binary ensemble classifier framework for classifying scientific artefacts in the EBM domain. Our balancing approaches improve the ensemble classifier’s F-score by up to 15% for classes of scientific artefacts with extremely low coverage in the domain. In addition, we propose a classifier selection method for choosing the best classifier based on the distributional feature of classes. The resulting classifiers show improved classification performances when compared to state of the art approaches.

[1]  José Alfredo Ferreira Costa,et al.  An Empirical Analysis of Under-Sampling Techniques to Balance a Protein Structural Class Dataset , 2006, ICONIP.

[2]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[3]  Lior Rokach,et al.  Data Mining and Knowledge Discovery Handbook, 2nd ed , 2010, Data Mining and Knowledge Discovery Handbook, 2nd ed..

[4]  David Martínez,et al.  Automatic classification of sentences to support Evidence Based Medicine , 2011, BMC Bioinformatics.

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Haruhiko Kimura,et al.  LVQ-SMOTE – Learning Vector Quantization based Synthetic Minority Over–sampling Technique for biomedical data , 2013, BioData Mining.

[7]  Jane Hunter,et al.  Identifying scientific artefacts in biomedical literature: The Evidence Based Medicine use case , 2014, J. Biomed. Informatics.

[8]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[9]  Dietrich Rebholz-Schuhmann,et al.  Automatic recognition of conceptualization zones in scientific articles and two life science applications , 2012, Bioinform..

[10]  Cecile Paris,et al.  An Approach for automatic multi-label classification of medical sentences , 2013 .

[11]  Roser Morante,et al.  A Statistical Relational Learning Approach to Identifying Evidence Based Medicine Categories , 2012, EMNLP.

[12]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[13]  Mohammad Khalilia,et al.  Predicting disease risks from highly imbalanced data using random forest , 2011, BMC Medical Informatics Decis. Mak..

[14]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..