An enhanced CRFs-based system for information extraction from radiology reports

We discuss the problem of performing information extraction from free-text radiology reports via supervised learning. In this task, segments of text (not necessarily coinciding with entire sentences, and possibly crossing sentence boundaries) need to be annotated with tags representing concepts of interest in the radiological domain. In this paper we present two novel approaches to IE for radiology reports: (i) a cascaded, two-stage method based on pipelining two taggers generated via the well known linear-chain conditional random fields (LC-CRFs) learner and (ii) a confidence-weighted ensemble method that combines standard LC-CRFs and the proposed two-stage method. We also report on the use of "positional features", a novel type of feature intended to aid in the automatic annotation of texts in which the instances of a given concept may be hypothesized to systematically occur in specific areas of the text. We present experiments on a dataset of mammography reports in which the proposed ensemble is shown to outperform a traditional, single-stage CRFs system in two different, applicatively interesting scenarios.

[1]  Yorick Wilks,et al.  Information Extraction: Beyond Document Retrieval , 1998, Int. J. Comput. Linguistics Chin. Lang. Process..

[2]  Qiang Yang,et al.  Transfer Learning for Text Mining , 2012, Mining Text Data.

[3]  Małgorzata Marciniak,et al.  Rule-based information extraction from patients' clinical data , 2009, J. Biomed. Informatics.

[4]  Erik M. van Mulligen,et al.  Using an ensemble system to improve concept extraction from clinical records , 2012, J. Biomed. Informatics.

[5]  D A Evans,et al.  Automating concept identification in the electronic medical record: an experiment in extracting dosage information. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[6]  Ralph Grishman,et al.  Information extraction for enhanced access to disease outbreak reports , 2002, J. Biomed. Informatics.

[7]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[8]  Fabio Roli,et al.  Methods for Designing Multiple Classifier Systems , 2001, Multiple Classifier Systems.

[9]  Yefeng Wang,et al.  Cascading Classifiers for Named Entity Recognition in Clinical Notes , 2009, BiomedicalIE@RANLP.

[10]  Lynda Tamine,et al.  Biomedical concept extraction based on combining the content-based and word order similarities , 2011, SAC.

[11]  Min Li,et al.  High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge , 2010, J. Am. Medical Informatics Assoc..

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[14]  Jun Suzuki,et al.  Training Conditional Random Fields with Multivariate Evaluation Measures , 2006, ACL.

[15]  Andrea Esuli,et al.  Evaluating Information Extraction , 2010, CLEF.

[16]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[17]  David Fisher,et al.  Machine Learning of Text Analysis Rules for Clinical Records , 1999 .

[18]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[19]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[20]  Hongfang Liu,et al.  Using machine learning for concept extraction on clinical documents from multiple data sources , 2011, J. Am. Medical Informatics Assoc..

[21]  Regina Barzilay,et al.  Finding Temporal Order in Discharge Summaries , 2006, AMIA.

[22]  Soo-Min Kim,et al.  Automatic Identification of Pro and Con Reasons in Online Reviews , 2006, ACL.

[23]  Siddhartha Jonnalagadda,et al.  Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules , 2012, J. Am. Medical Informatics Assoc..

[24]  Feiping Nie,et al.  Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization , 2011, SIGIR.

[25]  Wei Xiong,et al.  Biomedical concept extraction using concept graphs and ontology-based mapping , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[26]  Siddhartha Jonnalagadda,et al.  Enhancing clinical concept extraction with distributional semantics , 2012, J. Biomed. Informatics.

[27]  Emanuele Pianta,et al.  The TextPro Tool Suite , 2008, LREC.

[28]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[29]  S. Soderland,et al.  Automatic structuring of radiology free-text reports. , 2001, Radiographics : a review publication of the Radiological Society of North America, Inc.

[30]  Julio Villena-Román,et al.  MIDAS: An Information-Extraction Approach to Medical Text Classification , 2008, Proces. del Leng. Natural.

[31]  Dingcheng Li,et al.  Conditional Random Fields and Support Vector Machines for Disorder Named Entity Recognition in Clinical Texts , 2008, BioNLP.

[32]  Andrew McCallum,et al.  Information Extraction , 2005, ACM Queue.

[33]  Alex E. Bell UML Fever: Diagnosis and Recovery , 2005, ACM Queue.

[34]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[35]  Hyoil Han,et al.  Converting Semi-structured Clinical Medical Records into Information and Knowledge , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[36]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[37]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[38]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..