Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text

The class imbalance problem is a key factor that affects the performance of many classification tasks when using machine learning methods. This mainly refers to the problem where the number of samples in certain classes is much greater than in others. Such imbalance considerably affects the performance of classifiers in which the majority class or classes are often favored, thus resulting in high-precision/low-recall classifiers. Named entity recognition in free text suffers from this problem to a large extent because in any given free text, many samples do not belong to a specific entity. Furthermore, the data used in this specific type of classification is in sequenced mode and is different than that used in other common classification tasks such as image classification, spam detection, and text classification in which no semantic or syntactic relation exists between samples. In this study, we propose an undersampling approach for sequenced data that preserves existing correlations between sequenced samples that comprise sentences and thus improve the performance of classifiers. We call this method balanced undersampling (BUS). Considering the recent increased interest in the use of NER in the chemical and biomedical domains, the proposed method is developed and tested on four recent state-of-the-art corpora in these domains, including BioCreative IV ChemDNER, Bio-entity Recognition Challenge of JNLPBA (JNLPBA), SemEval2013 DDI DrugBank, and SemEval2013 DDI Medline datasets. The performance of the proposed method is evaluated against two other common undersampling methods: random undersampling and stop-word filtering. Our method is shown to outperform both methods with respect to F-score for all datasets used.

[1]  Ralescu Anca,et al.  ISSUES IN MINING IMBALANCED DATA SETS - A REVIEW PAPER , 2005 .

[2]  Lei Wang,et al.  AdaBoost with SVM-based component classifiers , 2008, Eng. Appl. Artif. Intell..

[3]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[4]  Xin Yao,et al.  Diversity exploration and negative correlation learning on imbalanced data sets , 2009, 2009 International Joint Conference on Neural Networks.

[5]  Svetha Venkatesh,et al.  Multi-class Pattern Classification in Imbalanced Data , 2010, 2010 20th International Conference on Pattern Recognition.

[6]  Alfonso Valencia,et al.  CHEMDNER: The drugs and chemical names extraction challenge , 2015, Journal of Cheminformatics.

[7]  Grigorios Tsoumakas,et al.  An ensemble uncertainty aware measure for directed hill climbing ensemble pruning , 2010, Machine Learning.

[8]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[9]  Paloma Martínez,et al.  SemEval-2013 Task 9 : Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013) , 2013, *SEMEVAL.

[10]  Claudio Giuliano,et al.  Instance Pruning by Filtering Uninformative Words: An Information Extraction Case Study , 2005, CICLing.

[11]  Lars Schmidt-Thieme,et al.  Learning Optimal Threshold on Resampling Data to Deal with Class Imbalance , 2010 .

[12]  Claudio Giuliano,et al.  Instance Filtering for entity recognition , 2005, SKDD.

[13]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[14]  Jieping Ye,et al.  A Small Sphere and Large Margin Approach for Novelty Detection Using Training Data with Outliers , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[16]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[17]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[18]  Naomie Salim,et al.  Chemical named entities recognition: a review on approaches and applications , 2014, Journal of Cheminformatics.

[19]  Qun Dai,et al.  A competitive ensemble pruning approach based on cross-validation technique , 2013, Knowl. Based Syst..

[20]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[21]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[22]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[23]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[24]  Roman Klinger,et al.  Classical Probabilistic Models and Conditional Random Fields , 2007 .

[25]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[26]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[27]  Xin Li,et al.  Protein classification with imbalanced data , 2007, Proteins.

[28]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[29]  Sakinat Oluwabukonla Folorunso Theoretical Comparison of Undersampling Techniques Against Their Underlying Data Reduction Techniques , 2012 .

[30]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[31]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[32]  Nigel Collier,et al.  Comparison of character-level and part of speech features for name recognition in biomedical texts , 2004, J. Biomed. Informatics.

[33]  Hong Gu,et al.  Imbalanced classification using support vector machine ensemble , 2011, Neural Computing and Applications.

[34]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[35]  A. Valencia,et al.  Overview of the chemical compound and drug name recognition ( CHEMDNER ) task , 2013 .

[36]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[37]  Hugh B. Woodruff,et al.  An algorithm for a selective nearest neighbor decision rule (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[38]  Abbas Akkasi,et al.  ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition , 2016, BioMed research international.

[39]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[40]  Mathias Kirsten,et al.  Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguistics , 1999 .

[41]  G. Gates The Reduced Nearest Neighbor Rule , 1998 .

[42]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[43]  U. Hahn,et al.  Reducing class imbalance during active learning for named entity annotation , 2009, K-CAP '09.

[44]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[45]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[46]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[47]  David P. Williams,et al.  Mine Classification With Imbalanced Data , 2009, IEEE Geoscience and Remote Sensing Letters.

[48]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[49]  G. Gates,et al.  The reduced nearest neighbor rule (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[50]  Naoaki Okazaki,et al.  Named entity recognition with multiple segment representations , 2013, Inf. Process. Manag..

[51]  Jordi Mestres,et al.  Identification of host interactions for phenotypic antimalarial hits , 2014, Journal of Cheminformatics.

[52]  Man-sun Kim An Effective Under-Sampling Method for Class Imbalance Data Problem , 2007 .

[53]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[54]  Ting Zhang,et al.  A new reverse reduce-error ensemble pruning algorithm , 2015, Appl. Soft Comput..

[55]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[56]  Rushi Longadge,et al.  Class Imbalance Problem in Data Mining Review , 2013, ArXiv.