A Comparative Analysis of Active Learning for Biomedical Text Mining

An enormous amount of clinical free-text information, such as pathology reports, progress reports, clinical notes and discharge summaries have been collected at hospitals and medical care clinics. These data provide an opportunity of developing many useful machine learning applications if the data could be transferred into a learn-able structure with appropriate labels for supervised learning. The annotation of this data has to be performed by qualified clinical experts, hence, limiting the use of this data due to the high cost of annotation. An underutilised technique of machine learning that can label new data called Active Learning (AL) is a promising candidate to address the high cost of the label the data. AL has been successfully applied to labelling speech recognition and text classification, however, there is a lack of literature investigating its use for clinical purposes. We performed a comparative investigation of various AL techniques using ML and deep learning (DL) based strategies on three unique biomedical datasets. We investigated Random Sampling (RS), Least confidence (LC), Informative diversity and density (IDD), Margin and Maximum representativeness-diversity (MRD) AL query strategies. Our experiments show that AL has the potential to significantly reducing the cost of manual labelling. Additionally, AL-assisted pre-annotations accelerates the de novo annotation process with less annotation time required.

[1]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[2]  S. Robboy,et al.  Progress in medical information management. Systematized nomenclature of medicine (SNOMED). , 1980, JAMA.

[3]  Jingqi Wang,et al.  Enhancing Clinical Concept Extraction with Contextual Embedding , 2019, J. Am. Medical Informatics Assoc..

[4]  Mabrook S. Al-Rakhami,et al.  Analyzing the Public Opinion as a Guide for Renewable-Energy Status in Malaysia: A Case Study , 2021 .

[5]  Sanna Salanterä,et al.  Overview of the ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[6]  Hua Xu,et al.  A study of active learning methods for named entity recognition in clinical text , 2015, J. Biomed. Informatics.

[7]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[9]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[10]  Matloob Khushi,et al.  Feature Learning for Stock Price Prediction Shows a Significant Role of Analyst Rating , 2021, Applied System Innovation.

[11]  Wei Wu,et al.  Safety-aware Graph-based Semi-Supervised Learning , 2018, Expert Syst. Appl..

[12]  Peter W. Eklund,et al.  COVIDSenti: A Large-Scale Benchmark Twitter Data Set for COVID-19 Sentiment Analysis , 2021, IEEE Transactions on Computational Social Systems.

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Matloob Khushi,et al.  SMOTE-ENC: A novel SMOTE-based method to generate synthetic data for nominal and continuous features , 2021, Applied System Innovation.

[15]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[16]  Andrea Esuli,et al.  An enhanced CRFs-based system for information extraction from radiology reports , 2013, J. Biomed. Informatics.

[17]  Lucila Ohno-Machado,et al.  Natural language processing: algorithms and tools to extract computable information from EHRs and from the biomedical literature , 2013, J. Am. Medical Informatics Assoc..

[18]  Imran Razzak,et al.  A Comprehensive Survey on Word Representation Models: From Classical to State-of-the-Art Word Representation Language Models , 2020, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[19]  Cynthia Brandt,et al.  Semi-supervised clinical text classification with Laplacian SVMs: An application to cancer case management , 2013, J. Biomed. Informatics.

[20]  Hans Lobel,et al.  Automatic document screening of medical literature using word and text embeddings in an active learning setting , 2020, Scientometrics.

[21]  Amol Wagholikar,et al.  Automated Reconciliation of Radiology Reports and Discharge Summaries , 2015, AMIA.

[22]  William W. Cohen,et al.  Probing Biomedical Embeddings from Language Models , 2019, Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for.

[23]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[24]  Imran Razzak,et al.  The role of unmanned aerial vehicles and mmWave in 5G: Recent advances and challenges , 2021, Trans. Emerg. Telecommun. Technol..

[25]  Imran Razzak,et al.  A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter , 2020, Multimedia Tools and Applications.

[26]  Ulf Leser,et al.  What makes a gene name? Named entity recognition in the biomedical literature , 2005, Briefings Bioinform..

[27]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[28]  Katarzyna Musial,et al.  Transformer based Deep Intelligent Contextual Embedding for Twitter sentiment analysis , 2020, Future Gener. Comput. Syst..

[29]  Ricky K. Taira,et al.  A Normalized Lexical Lookup Approach to Identifying UMLS Concepts in Free Text , 2007, MedInfo.

[30]  Ibrahim A. Hameed,et al.  Deep Context-Aware Embedding for Abusive and Hate Speech detection on Twitter , 2019, Aust. J. Intell. Inf. Process. Syst..

[31]  Anna Korhonen,et al.  Automatic semantic classification of scientific literature according to the hallmarks of cancer , 2016, Bioinform..

[32]  Matloob Khushi,et al.  A Survey of Forex and Stock Price Prediction Using Deep Learning , 2021, Applied System Innovation.

[33]  Hong Liu,et al.  Biomedical Named Entity Recognition based on Deep Neutral Network , 2015 .

[34]  Atul Gupta,et al.  Active Learning Query Strategies for Classification, Regression, and Clustering: A Survey , 2020, Journal of Computer Science and Technology.

[35]  Ioannis Ch. Paschalidis,et al.  Clinical Concept Extraction with Contextual Word Embedding , 2018, NIPS 2018.

[36]  Hua Xu,et al.  Applying active learning to assertion classification of concepts in clinical text , 2012, J. Biomed. Informatics.

[37]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[38]  Usman Naseem,et al.  Abusive Language Detection: A Comprehensive Review , 2019 .

[39]  Natalia Grabar,et al.  Linguistic approach for identification of medication names and related information in clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[40]  James W. Cooper,et al.  Text analytics for life science using the Unstructured Information Management Architecture , 2004, IBM Syst. J..

[41]  Gurpreet Singh Lehal,et al.  A Survey of Text Mining Techniques and Applications , 2009 .

[42]  Eduardo P. Wiechmann,et al.  Active learning for clinical text classification: is it better than random sampling? , 2012, J. Am. Medical Informatics Assoc..

[43]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[44]  Usman Naseem,et al.  Performance Evaluation of Next-Generation Wireless (5G) UAV Relay , 2020, Wireless Personal Communications.

[45]  Henrik Boström,et al.  De-identifying health records by means of active learning , 2012, ICML 2012.

[46]  Lishuang Li,et al.  Recognizing Biomedical Named Entities Based on the Sentence Vector/Twin Word Embeddings Conditioned Bidirectional LSTM , 2016, CCL.

[47]  Usman Naseem,et al.  Hybrid Words Representation for Airlines Sentiment Analysis , 2019, Australasian Conference on Artificial Intelligence.

[48]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[49]  Hyunju Lee,et al.  Biomedical named entity recognition using deep neural networks with contextual information , 2019, BMC Bioinformatics.

[50]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[51]  Chengjie Sun,et al.  LSTM-CRF for Drug-Named Entity Recognition , 2017, Entropy.

[52]  Goran Nenadic,et al.  Medication information extraction with linguistic pattern matching and semantic rules , 2010, J. Am. Medical Informatics Assoc..

[53]  Hongfei Lin,et al.  An attention‐based BiLSTM‐CRF approach to document‐level chemical named entity recognition , 2018, Bioinform..

[54]  Usman Naseem,et al.  Link‐level Performance Modelling for Next-Generation UAV Relay with Millimetre‐Wave Simultaneously in Access and Backhaul , 2019 .

[55]  Matloob Khushi,et al.  BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition , 2020, 2021 International Joint Conference on Neural Networks (IJCNN).

[56]  John O'Dwyer,et al.  Automated Cancer Registry Notifications: Validation of a Medical Text Analytics System for Identifying Patients with Cancer from a State-Wide Pathology Repository , 2016, AMIA.

[57]  Matloob Khushi,et al.  Text Mining of Stocktwits Data for Predicting Stock Prices , 2021, Applied System Innovation.

[58]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.