Cost-aware active learning for named entity recognition in clinical text

OBJECTIVE Active Learning (AL) attempts to reduce annotation cost (ie, time) by selecting the most informative examples for annotation. Most approaches tacitly (and unrealistically) assume that the cost for annotating each sample is identical. This study introduces a cost-aware AL method, which simultaneously models both the annotation cost and the informativeness of the samples and evaluates both via simulation and user studies. MATERIALS AND METHODS We designed a novel, cost-aware AL algorithm (Cost-CAUSE) for annotating clinical named entities; we first utilized lexical and syntactic features to estimate annotation cost, then we incorporated this cost measure into an existing AL algorithm. Using the 2010 i2b2/VA data set, we then conducted a simulation study comparing Cost-CAUSE with noncost-aware AL methods, and a user study comparing Cost-CAUSE with passive learning. RESULTS Our cost model fit empirical annotation data well, and Cost-CAUSE increased the simulation area under the learning curve (ALC) scores by up to 5.6% and 4.9%, compared with random sampling and alternate AL methods. Moreover, in a user annotation task, Cost-CAUSE outperformed passive learning on the ALC score and reduced annotation time by 20.5%-30.2%. DISCUSSION Although AL has proven effective in simulations, our user study shows that a real-world environment is far more complex. Other factors have a noticeable effect on the AL method, such as the annotation accuracy of users, the tiredness of users, and even the physical and mental condition of users. CONCLUSION Cost-CAUSE saves significant annotation cost compared to random sampling.

[1]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[2]  Gina R Kuperberg,et al.  What do we mean by prediction in language comprehension? , 2016, Language, cognition and neuroscience.

[3]  Eric K. Ringger,et al.  Assessing the Costs of Machine-Assisted Corpus Annotation through a User Study , 2008, LREC.

[4]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[5]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[6]  Jaime G. Carbonell,et al.  Proactive learning: cost-sensitive active learning with multiple imperfect oracles , 2008, CIKM '08.

[7]  Qingxia Chen,et al.  An active learning-enabled annotation system for clinical named entity recognition , 2017, BMC Medical Informatics and Decision Making.

[8]  Anderson Spickard,et al.  Research Paper: "Understanding" Medical School Curriculum Content Using KnowledgeMap , 2003, J. Am. Medical Informatics Assoc..

[9]  Gary Geunbae Lee,et al.  MMR-based Active Machine Learning for Bio Named Entity Recognition , 2006, NAACL.

[10]  Anthony N. Nguyen,et al.  Active learning: a step towards automating medical concept extraction , 2015, J. Am. Medical Informatics Assoc..

[11]  Eric Horvitz,et al.  Selective Supervision: Guiding Supervised Learning with Decision-Theoretic Active Learning , 2007, IJCAI.

[12]  Kai Zheng,et al.  Applying active learning to supervised word sense disambiguation in MEDLINE , 2013, J. Am. Medical Informatics Assoc..

[13]  Carol Friedman,et al.  Towards a comprehensive medical language processing system: methods and issues , 1997, AMIA.

[14]  Hua Xu,et al.  A study of active learning methods for named entity recognition in clinical text , 2015, J. Biomed. Informatics.

[15]  Stephen T. Wu,et al.  Complexity Metrics in an Incremental Right-Corner Parser , 2010, ACL.

[16]  S. Mani,et al.  Extracting and integrating data from entire electronic health records for detecting colorectal cancer cases. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[17]  Melissa A. Basford,et al.  The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future , 2013, Genetics in Medicine.

[18]  Louise Deléger,et al.  A sequence labeling approach to link medications and their attributes in clinical notes and clinical trial announcements for information extraction , 2012, J. Am. Medical Informatics Assoc..

[19]  David A. Ferrucci IBM's Watson/DeepQA , 2011, SIGARCH Comput. Archit. News.

[20]  Mark Craven,et al.  Multiple-Instance Active Learning , 2007, NIPS.

[21]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[22]  W. DuMouchel,et al.  Unlocking Clinical Data from Narrative Reports: A Study of Natural Language Processing , 1995, Annals of Internal Medicine.

[23]  Randolph A. Miller,et al.  Research Paper: Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents , 2009, J. Am. Medical Informatics Assoc..

[24]  K. Chaloner,et al.  Bayesian Experimental Design: A Review , 1995 .

[25]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[26]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[27]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[28]  C. Chute,et al.  Electronic Medical Records for Genetic Research: Results of the eMERGE Consortium , 2011, Science Translational Medicine.

[29]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[30]  Shlomo Argamon,et al.  Committee-Based Sampling For Training Probabilistic Classi(cid:12)ers , 1995 .

[31]  George Hripcsak,et al.  Automated detection of adverse events using natural language processing of discharge summaries. , 2005, Journal of the American Medical Informatics Association : JAMIA.

[32]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[33]  Udo Hahn,et al.  A Comparison of Models for Cost-Sensitive Active Learning , 2010, COLING.

[34]  Ying Li,et al.  Validating drug repurposing signals using electronic health records: a case study of metformin associated with reduced cancer mortality , 2014, J. Am. Medical Informatics Assoc..

[35]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[36]  Anna Rumshisky,et al.  Evaluating temporal relations in clinical text: 2012 i2b2 Challenge , 2013, J. Am. Medical Informatics Assoc..

[37]  Hong Yu,et al.  Learning for Biomedical Information Extraction: Methodological Review of Recent Advances , 2016, ArXiv.

[38]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[39]  Hua Xu,et al.  A hybrid system for temporal information extraction from clinical text , 2013, J. Am. Medical Informatics Assoc..

[40]  Jason Baldridge,et al.  How well does active learning actually work? Time-based evaluation of cost-reduction strategies for language documentation. , 2009, EMNLP.

[41]  Z. Harris A Theory of Language and Information: A Mathematical Approach , 1991 .

[42]  Carolyn Penstein Rosé,et al.  Estimating Annotation Cost for Active Learning in a Multi-Annotator Environment , 2009, HLT-NAACL 2009.

[43]  Jingbo Zhu,et al.  Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem , 2007, EMNLP.

[44]  Anthony N. Nguyen,et al.  Active learning reduces annotation time for clinical concept extraction , 2017, Int. J. Medical Informatics.

[45]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[46]  Carla E. Brodley,et al.  Active learning for biomedical citation screening , 2010, KDD.

[47]  Randolph A. Miller,et al.  Development and Evaluation of a Clinical Note Section Header Terminology , 2008, AMIA.

[48]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[49]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[50]  Eric K. Ringger,et al.  Assessing the Costs of Sampling Methods in Active Learning for Annotation , 2008, ACL.

[51]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[52]  Hua Xu,et al.  Applying active learning to high-throughput phenotyping algorithms for electronic health records data. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[53]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[54]  Hongfang Liu,et al.  Journal of Biomedical Informatics , 2022 .

[55]  Min Li,et al.  High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge , 2010, J. Am. Medical Informatics Assoc..

[56]  Hua Xu,et al.  Clinical entity recognition using structural support vector machines with rich features , 2012, DTMBIO '12.

[57]  Hongfang Liu,et al.  Research and applications: Patient-level temporal aggregation for text-based asthma status ascertainment , 2014, J. Am. Medical Informatics Assoc..

[58]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[59]  Carol Friedman,et al.  Facilitating Cancer Research using Natural Language Processing of Pathology Reports , 2004, MedInfo.

[60]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[61]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[62]  Ted Pedersen,et al.  UMLS-Interface and UMLS-Similarity : Open Source Software for Measuring Paths and Semantic Similarity , 2009, AMIA.

[63]  George Hripcsak,et al.  Automated encoding of clinical documents based on natural language processing. , 2004, Journal of the American Medical Informatics Association : JAMIA.

[64]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[65]  Hua Xu,et al.  Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features , 2013, BMC Medical Informatics and Decision Making.

[66]  Guergana K. Savova,et al.  Active Learning for Coreference Resolution , 2012, BioNLP@HLT-NAACL.

[67]  Mark Craven,et al.  Active Learning with Real Annotation Costs , 2008 .

[68]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[69]  F. Wilcoxon,et al.  Probability tables for individual comparisons by ranking methods. , 1947, Biometrics.

[70]  Randolph A. Miller,et al.  Identifying UMLS concepts from ECG Impressions using Knowledge Map , 2005, AMIA.

[71]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[72]  Burr Settles,et al.  Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances , 2011, EMNLP.