Clustering based active learning for biomedical Named Entity Recognition

The recognition and extraction of biomedical names is an essential task for the biomedical information extraction. However, the preparation of large annotated corpora hinders the training of the Named Entity Recognition (NER) systems. Active learning is reducing the needed manual annotation work in supervised learning task. In this work, we propose a novel clustering based active learning method for the biomedical NER task. We show that the underlying NER system using the proposed method outperforms those with other state of the art active learning methods, including density, Gibbs error and entropy based approaches, as well as the random selection. We compare variations of our proposed method and find the optimal design of the active learning method, which is to use the vector representation of named entities, and to select documents that are `representative' and `informative', as well as to use the Shared Nearest Neighbor (SNN) clustering approach. In particular, the optimal variant of the proposed method achieves a deficiency gain of 36.3% over the random selection.

[1]  Guodong Zhou,et al.  Active Learning for Imbalanced Sentiment Classification , 2012, EMNLP.

[2]  Udo Hahn,et al.  A Comparison of Models for Cost-Sensitive Active Learning , 2010, COLING.

[3]  Tianshun Yao,et al.  Active Learning with Sampling by Uncertainty and Density for Word Sense Disambiguation and Text Classification , 2008, COLING.

[4]  José Luís Oliveira,et al.  Gimli: open source and high-performance biomedical name recognition , 2013, BMC Bioinformatics.

[5]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[6]  Shoushan Li,et al.  Active Learning on Sentiment Classification by Selecting Both Words and Documents , 2012, CLSW.

[7]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[8]  Arthur Zimek,et al.  A Framework for Clustering Uncertain Data , 2015, Proc. VLDB Endow..

[9]  Hinrich Schütze,et al.  Performance thresholding in practical text classification , 2006, CIKM '06.

[10]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[11]  Marco Wiering,et al.  2011 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) , 2011, IJCNN 2011.

[12]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[13]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[14]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[15]  Christian Igel,et al.  Active learning with support vector machines , 2014, WIREs Data Mining Knowl. Discov..

[16]  Bo Zhang,et al.  Entropy-based active learning with support vector machines for content-based image retrieval , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[17]  Pietro Perona,et al.  Entropy-based active learning for object recognition , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[18]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[19]  Xu Han,et al.  Active learning for ontological event extraction incorporating named entity recognition and unknown word handling , 2016, J. Biomed. Semant..

[20]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[21]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[22]  Carla E. Brodley,et al.  Active learning for biomedical citation screening , 2010, KDD.

[23]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[24]  Udo Hahn,et al.  On Proper Unit Selection in Active Learning: Co-Selection Effects for Named Entity Recognition , 2009, HLT-NAACL 2009.

[25]  U. Hahn,et al.  Reducing class imbalance during active learning for named entity annotation , 2009, K-CAP '09.

[26]  Udo Hahn,et al.  Semi-Supervised Active Learning for Sequence Labeling , 2009, ACL.

[27]  Xin Li,et al.  Active Learning with Multi-Label SVM Classification , 2013, IJCAI.

[28]  Guodong Zhou,et al.  Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification , 2010, EMNLP.

[29]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[30]  Jian Zhang,et al.  A Certainty-Based Active Learning Framework of Meeting Speech Summarization , 2014 .

[31]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[32]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[33]  Lyle H. Ungar,et al.  Machine Learning manuscript No. (will be inserted by the editor) Active Learning for Logistic Regression: , 2007 .

[34]  Greg Hamerly,et al.  Making k-means Even Faster , 2010, SDM.

[35]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[36]  Kwang Ryel Ryu,et al.  Using Cluster-Based Sampling to Select Initial Training Set for Active Learning in Text Classification , 2004, PAKDD.

[37]  Nan Ye,et al.  Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion , 2013, NIPS.

[38]  Kai Zheng,et al.  Applying active learning to supervised word sense disambiguation in MEDLINE , 2013, J. Am. Medical Informatics Assoc..