Exploiting Unlabeled Texts with Clustering-based Instance Selection for Medical Relation Classification

Classifying relations between pairs of medical concepts in clinical texts is a crucial task to acquire empirical evidence relevant to patient care. Due to limited labeled data and extremely unbalanced class distributions, medical relation classification systems struggle to achieve good performance on less common relation types, which capture valuable information that is important to identify. Our research aims to improve relation classification using weakly supervised learning. We present two clustering-based instance selection methods that acquire a diverse and balanced set of additional training instances from unlabeled data. The first method selects one representative instance from each cluster containing only unlabeled data. The second method selects a counterpart for each training instance using clusters containing both labeled and unlabeled data. These new instance selection methods for weakly supervised learning achieve substantial recall gains for the minority relation classes compared to supervised learning, while yielding comparable performance on the majority relation classes.

[1]  Zhu Zhang,et al.  Weakly-supervised relation classification for information extraction , 2004, CIKM '04.

[2]  Ellen Riloff,et al.  Stacked Generalization for Medical Concept Extraction from Clinical Notes , 2015, BioNLP@IJCNLP.

[3]  Joel D. Martin,et al.  Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010 , 2011, J. Am. Medical Informatics Assoc..

[4]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[5]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[6]  Martial Hebert,et al.  Semi-Supervised Self-Training of Object Detection Models , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[7]  Joel D. Martin,et al.  Detecting concept relations in clinical text: Insights from a state-of-the-art model , 2013, J. Biomed. Informatics.

[8]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[9]  T. H. Kyaw,et al.  Multiparameter Intelligent Monitoring in Intensive Care II: A public-access intensive care unit database* , 2011, Critical care medicine.

[10]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[11]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[12]  A. W. Pratt Medicine, Computers, and Linguistics , 1973 .

[13]  Ellen Riloff,et al.  A Study of Concept Extraction Across Different Types of Clinical Notes , 2015, AMIA.

[14]  David E. Irwin,et al.  Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior , 2011, 2011 31st International Conference on Distributed Computing Systems Workshops.

[15]  Philip J. Stone,et al.  Extracting Information. (Book Reviews: The General Inquirer. A Computer Approach to Content Analysis) , 1967 .

[16]  Chang Wang,et al.  Medical Relation Extraction with Manifold Models , 2014, ACL.

[17]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[18]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[19]  Vincent Ng,et al.  Ensemble-Based Medical Relation Classification , 2014, COLING.

[20]  Marshall S. Smith,et al.  The general inquirer: A computer approach to content analysis. , 1967 .

[21]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[22]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[23]  Wendy W. Chapman,et al.  ConText: An Algorithm for Identifying Contextual Features from Clinical Text , 2007, BioNLP@ACL.

[24]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[25]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[26]  Sanda M. Harabagiu,et al.  Automatic extraction of relations between medical concepts in clinical texts , 2011, J. Am. Medical Informatics Assoc..

[27]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[28]  J. Stoker,et al.  The Department of Health and Human Services. , 1999, Home healthcare nurse.

[29]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[30]  Claire Cardie,et al.  Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[31]  Ralph Grishman,et al.  Semi-supervised Relation Extraction with Large-scale Word Clustering , 2011, ACL.

[32]  Ellen Riloff,et al.  Improving Classification of Medical Assertions in Clinical Notes , 2011, ACL.