Improved Identification of Noun Phrases in Clinical Radiology Reports Using a High-Performance Statistical Natural Language Parser Augmented with the UMLS Specialist Lexicon

OBJECTIVE The aim of this study was to develop and evaluate a method of extracting noun phrases with full phrase structures from a set of clinical radiology reports using natural language processing (NLP) and to investigate the effects of using the UMLS(R) Specialist Lexicon to improve noun phrase identification within clinical radiology documents. DESIGN The noun phrase identification (NPI) module is composed of a sentence boundary detector, a statistical natural language parser trained on a nonmedical domain, and a noun phrase (NP) tagger. The NPI module processed a set of 100 XML-represented clinical radiology reports in Health Level 7 (HL7)(R) Clinical Document Architecture (CDA)-compatible format. Computed output was compared with manual markups made by four physicians and one author for maximal (longest) NP and those made by one author for base (simple) NP, respectively. An extended lexicon of biomedical terms was created from the UMLS Specialist Lexicon and used to improve NPI performance. RESULTS The test set was 50 randomly selected reports. The sentence boundary detector achieved 99.0% precision and 98.6% recall. The overall maximal NPI precision and recall were 78.9% and 81.5% before using the UMLS Specialist Lexicon and 82.1% and 84.6% after. The overall base NPI precision and recall were 88.2% and 86.8% before using the UMLS Specialist Lexicon and 93.1% and 92.6% after, reducing false-positives by 31.1% and false-negatives by 34.3%. CONCLUSION The sentence boundary detector performs excellently. After the adaptation using the UMLS Specialist Lexicon, the statistical parser's NPI performance on radiology reports increased to levels comparable to the parser's native performance in its newswire training domain and to that reported by other researchers in the general nonmedical domain.

[1]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[2]  H J Lowe Image Engine: an object-oriented multimedia database for storing, retrieving and sharing medical images and text. , 1993, Proceedings. Symposium on Computer Applications in Medical Care.

[3]  E. B. Wilson Probable Inference, the Law of Succession, and Statistical Inference , 1927 .

[4]  Henry J. Lowe,et al.  Selective Automated Indexing of Findings and Diagnoses in Radiology Reports , 2001, J. Biomed. Informatics.

[5]  Allen C. Browne,et al.  Lexical methods for managing variation in biomedical terminologies. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.

[6]  Ralph Grishman,et al.  The NYU System for MUC-6 or Where’s the Syntax? , 1995, MUC.

[7]  Rebecca Hwa Supervised Grammar Induction using Training Data with Limited Constituent Information , 1999, ACL.

[8]  Peter L. Elkin,et al.  UMLS Concept Indexing for Production Databases: A Feasibility Study , 2001, J. Am. Medical Informatics Assoc..

[9]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[10]  C A Smith,et al.  Automated Semantic Indexing of Imaging Reports to Support Retrieval of Medical Images in the Multimedia Electronic Medical Record , 1999, Methods of Information in Medicine.

[11]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[12]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[13]  Craig A. Morioka,et al.  IndexFinder: A Method of Extracting Key Concepts from Clinical Texts for Indexing , 2003, AMIA.

[14]  Peter J. Haug,et al.  MPLUS: a probabilistic medical language understanding system , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[15]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[16]  N Sager,et al.  Automatic encoding into SNOMED III: a preliminary investigation. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.

[17]  Alan R. Aronson,et al.  Towards linking patients and clinical information: detecting UMLS concepts in e-mail , 2003, J. Biomed. Informatics.

[18]  George Hripcsak,et al.  Reference Standards, Judges, and Comparison Subjects , 2002 .

[19]  S. Soderland,et al.  Automatic structuring of radiology free-text reports. , 2001, Radiographics : a review publication of the Radiological Society of North America, Inc.

[20]  R A Greenes,et al.  SAPHIRE--an information retrieval system featuring concept matching, automatic indexing, probabilistic retrieval, and hierarchical relationships. , 1990, Computers and biomedical research, an international journal.

[21]  G. Hripcsak,et al.  Extracting Findings from Narrative Reports: Software Transferability and Sources of Physician Disagreement , 1998, Methods of Information in Medicine.

[22]  Yang Huang,et al.  Research Paper: A Pilot Study of Contextual UMLS Indexing to Improve the Precision of Concept-based Representation in XML-structured Clinical Radiology Reports , 2003, J. Am. Medical Informatics Assoc..

[23]  K A Spackman,et al.  Recognizing noun phrases in medical discharge summaries: an evaluation of two natural language parsers. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[24]  Peter Szolovits,et al.  Adding a Medical Lexicon to an English Parser , 2003, AMIA.

[25]  Ricky K. Taira,et al.  A statistical natural language processor for medical reports , 1999, AMIA.

[26]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[27]  P. V. Biron,et al.  The HL7 Clinical Document Architecture. , 2001, Journal of the American Medical Informatics Association : JAMIA.

[28]  Lawrence M. Fagan,et al.  Research Paper: Methods for Semi-automated Indexing for High Precision Information Retrieval , 2002, J. Am. Medical Informatics Assoc..

[29]  George Hripcsak,et al.  Review Paper: Reference Standards, Judges, and Comparison Subjects: Roles for Experts in Evaluating System Performance , 2002, J. Am. Medical Informatics Assoc..

[30]  Didier Bourigault,et al.  Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases , 1992, COLING.

[31]  E H Shortliffe,et al.  Contextual models of clinical publications for enhancing retrieval from full-text databases. , 1995, Proceedings. Symposium on Computer Applications in Medical Care.

[32]  Carol Friedman,et al.  A broad-coverage natural language processing system , 2000, AMIA.

[33]  Claire Cardie,et al.  Error-Driven Pruning of Treebank Grammars for Base Noun Phrase Identification , 1998, ACL.

[34]  J. Austin,et al.  Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports. , 2002, Radiology.

[35]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[36]  Hongfang Liu,et al.  Mining Terminological Knowledge in Large Biomedical Corpora , 2003, Pacific Symposium on Biocomputing.

[37]  Randolph A. Miller,et al.  Research Paper: An Experiment Comparing Lexical and Statistical Methods for Extracting MeSH Terms from Clinical Free Text , 1998, J. Am. Medical Informatics Assoc..

[38]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[39]  Dimitri P. Bertsekas,et al.  Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[40]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[41]  William R. Hersh,et al.  Information Retrieval in Medicine: The SAPHIRE Experience , 1995, J. Am. Soc. Inf. Sci..

[42]  Daniel C. Berrios Automated indexing for full text information retrieval , 2000, AMIA.

[43]  Allen C. Browne,et al.  UMLS knowledge for biomedical language processing. , 1993, Bulletin of the Medical Library Association.

[44]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[45]  William T. Hole,et al.  Finding UMLS Metathesaurus concepts in MEDLINE , 2002, AMIA.

[46]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[47]  H J Lowe Multimedia electronic medical record systems. , 1999, Academic medicine : journal of the Association of American Medical Colleges.

[48]  Jerry R. Hobbs SRI International's TACITUS system: MUC-3 test results and analysis , 1991, MUC.

[49]  Robert H. Baud,et al.  The future of natural language processing for biomedical applications , 2002, Int. J. Medical Informatics.

[50]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[51]  Bruce R. Schatz,et al.  Extracting noun phrases for all of MEDLINE , 1999, AMIA.

[52]  P M Pietrzyk,et al.  Free text analysis. , 1995, International journal of bio-medical computing.

[53]  Lawrence M. Fagan,et al.  Knowledge requirements for automated inference of medical textbook markup , 1999, AMIA.

[54]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.