An Efficient Text Labeling Framework Using Active Learning Model

Electronic medical discharge summaries provide a wealth of information. Extracting useful structured information from such unstructured text is challenging. However, supervised machine learning (ML) algorithms can achieve good performance in extracting useful relations between different entities. To use supervised ML techniques, huge annotated datasets are required. Annotating manually is very expensive and time taking due to the requirement of domain experts for annotation. Active learning (AL), a sample selection approach integrated with supervised ML, aims to minimize the annotation cost while maximizing the performance of ML-based models. Active learning leverages the advantage of training the classifier with a limited number of samples but achieving maximum performance. This strategy not only saves time but also decreases the annotation cost involved. Active learning works well with datasets where annotation cost is high, and training a decent classifier with the available annotated dataset is a requirement. The key factor for an active learning model’s success is its selection of samples that needs annotation. The more informative the samples are, the less time it takes to train the supervised model with high accuracy. Thus, the query strategy in sample selection plays a vital role in the AL process. In this study, we aim to develop a novel query strategy to select the most informative samples from the dataset that can eventually accelerate the supervised model’s performance. The query strategy is designed using deep reinforcement learning techniques like actor-critic. The performance of the sample selection strategy is determined by finding the accuracy of the model after a predefined number of iterations.

[1]  Deepa Gupta,et al.  Roadmap for polarity lexicon learning and resources: A survey , 2016 .

[2]  Massimo Piccardi,et al.  Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition , 2017, J. Biomed. Informatics.

[3]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[4]  Martha Palmer,et al.  Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling , 2011, ACL.

[5]  Ioannis Ch. Paschalidis,et al.  Clinical Concept Extraction with Contextual Word Embedding , 2018, NIPS 2018.

[6]  Regina Barzilay,et al.  Improving Information Extraction by Acquiring External Evidence with Reinforcement Learning , 2016, EMNLP.

[7]  Zachary C. Lipton,et al.  Deep Bayesian Active Learning for Natural Language Processing: Results of a Large-Scale Empirical Study , 2018, EMNLP.

[8]  Deepa Gupta,et al.  Annotation Guidelines for Hindi-English Word Alignment , 2010, 2010 International Conference on Asian Language Processing.

[9]  Nicolas Labroche,et al.  Active seed selection for constrained clustering , 2017, Intell. Data Anal..

[10]  Yuan Li,et al.  Learning how to Active Learn: A Deep Reinforcement Learning Approach , 2017, EMNLP.

[11]  Bruno J. T. Fernandes,et al.  Human feedback in continuous actor-critic reinforcement learning , 2019, ESANN.

[12]  Massimo Piccardi,et al.  Bidirectional LSTM-CRF for Clinical Concept Extraction , 2016, ClinicalNLP@COLING 2016.

[13]  Jun'ichi Tsujii,et al.  Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries , 2012, J. Am. Medical Informatics Assoc..

[14]  Abdelouahid Lyhyaoui,et al.  Sample Selection Based Active Learning for Imbalanced Data , 2014, 2014 Tenth International Conference on Signal-Image Technology and Internet-Based Systems.

[15]  Oladimeji Farri,et al.  Learning to Diagnose: Assimilating Clinical Narratives using Deep Reinforcement Learning , 2017, IJCNLP.