NOBLE – Flexible concept recognition for large-scale biomedical natural language processing

BackgroundNatural language processing (NLP) applications are increasingly important in biomedical data analysis, knowledge engineering, and decision support. Concept recognition is an important component task for NLP pipelines, and can be either general-purpose or domain-specific. We describe a novel, flexible, and general-purpose concept recognition component for NLP pipelines, and compare its speed and accuracy against five commonly used alternatives on both a biological and clinical corpus.NOBLE Coder implements a general algorithm for matching terms to concepts from an arbitrary vocabulary set. The system’s matching options can be configured individually or in combination to yield specific system behavior for a variety of NLP tasks. The software is open source, freely available, and easily integrated into UIMA or GATE. We benchmarked speed and accuracy of the system against the CRAFT and ShARe corpora as reference standards and compared it to MMTx, MGrep, Concept Mapper, cTAKES Dictionary Lookup Annotator, and cTAKES Fast Dictionary Lookup Annotator.ResultsWe describe key advantages of the NOBLE Coder system and associated tools, including its greedy algorithm, configurable matching strategies, and multiple terminology input formats. These features provide unique functionality when compared with existing alternatives, including state-of-the-art systems. On two benchmarking tasks, NOBLE’s performance exceeded commonly used alternatives, performing almost as well as the most advanced systems. Error analysis revealed differences in error profiles among systems.ConclusionNOBLE Coder is comparable to other widely used concept recognition systems in terms of accuracy and speed. Advantages of NOBLE Coder include its interactive terminology builder tool, ease of configuration, and adaptability to various domains and tasks. NOBLE provides a term-to-concept matching system suitable for general concept recognition in biomedical NLP pipelines.

[1]  Noémie Elhadad,et al.  Natural Language Processing in Health Care and Biomedicine , 2014 .

[2]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[3]  Yaoyun Zhang,et al.  UTH_CCB: A report for SemEval 2014 – Task 7 Analysis of Clinical Text , 2014, *SEMEVAL.

[4]  Craig A. Morioka,et al.  IndexFinder: A Method of Extracting Key Concepts from Clinical Texts for Indexing , 2003, AMIA.

[5]  Kai Zheng,et al.  Bootstrapping a de-identification system for narrative patient records: Cost-performance tradeoffs , 2013, Int. J. Medical Informatics.

[6]  Domonkos Tikk,et al.  Improving textual medication extraction using combined conditional random fields and rule-based systems , 2010, J. Am. Medical Informatics Assoc..

[7]  Chen Lin,et al.  A system for coreference resolution for the clinical narrative , 2012, J. Am. Medical Informatics Assoc..

[8]  K. Bretonnel Cohen,et al.  Getting Started in Text Mining , 2008, PLoS Comput. Biol..

[9]  Daniel L. Rubin,et al.  Comparison of concept recognizers for building the Open Biomedical Annotator , 2009, BMC Bioinformatics.

[10]  Anni Coden,et al.  The ConceptMapper Approach to Named Entity Recognition , 2010, LREC.

[11]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[12]  Andrew Smith,et al.  Using Gazetteers in Discriminative Information Extraction , 2006, CoNLL.

[13]  K. Bretonnel Cohen,et al.  Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters , 2014, BMC Bioinformatics.

[14]  Guy Divita,et al.  Failure Analysis of MetaMap Transfer (MMTx) , 2004, MedInfo.

[15]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[16]  C G Chute,et al.  Effectiveness of Lexico-syntactic Pattern Matching for Ontology Enrichment with Clinical Documents , 2010, Methods of Information in Medicine.

[17]  A Abu-Hanna,et al.  Understanding Terminological Systems II: Experience with Conceptual and Formal Representation of Structure , 2000, Methods of Information in Medicine.

[18]  Wesley W. Chu,et al.  IndexFinder : A Knowledge-based Method for Indexing Clinical Texts , 2003 .

[19]  Michael Feldman,et al.  caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research , 2010, J. Am. Medical Informatics Assoc..

[20]  Son Doan,et al.  Recognition of medication information from discharge summaries using ensembles of classifiers , 2012, BMC Medical Informatics and Decision Making.

[21]  H. Sinclair,et al.  What is a Word , 1974 .

[22]  Chun-Nan Hsu,et al.  Integrating high dimensional bi-directional parsing models for gene mention tagging , 2008, ISMB.

[23]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[24]  Zhiyong Lu,et al.  The gene normalization task in BioCreative III , 2011, BMC Bioinformatics.

[25]  G K Savova,et al.  Formative evaluation of ontology learning methods for entity discovery by using existing ontologies as reference standards. , 2013, Methods of information in medicine.

[26]  Suresh Manandhar,et al.  SemEval-2014 Task 7: Analysis of Clinical Text , 2014, *SEMEVAL.

[27]  Jules J. Berman,et al.  Doublet method for very fast autocoding , 2004, BMC Medical Informatics Decis. Mak..

[28]  William R. Hogan,et al.  Natural Language Processing methods and systems for biomedical ontology learning , 2011, J. Biomed. Informatics.

[29]  N. D. de Keizer,et al.  Understanding Terminological Systems I: Terminology and Typology , 2000, Methods of Information in Medicine.

[30]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[31]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[32]  Allen C. Browne,et al.  UMLS language and vocabulary tools. , 2003, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[33]  de Keizer,et al.  Understanding terminological systems . II : Experience with conceptual and formal representation of structure , 2000 .

[34]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[35]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[36]  J. Cimino Desiderata for Controlled Medical Vocabularies in the Twenty-First Century , 1998, Methods of Information in Medicine.

[37]  E. Wertheim,et al.  What is in a word? , 1972, The British journal of medical psychology.

[38]  Mark A. Musen,et al.  Comparing Concept Recognizers for Ontology-Based Indexing : MGREP vs . MetaMap , 2008 .

[39]  Jules J. Berman Automatic extraction of candidate nomenclature terms using the doublet method , 2005, BMC Medical Informatics Decis. Mak..

[40]  Mark A. Musen,et al.  The Open Biomedical Annotator , 2009, Summit on translational bioinformatics.

[41]  Masaki Murata,et al.  Gene/protein name recognition based on support vector machine using dictionary as features , 2005, BMC Bioinformatics.

[42]  Syed Sibte Raza Abidi,et al.  Comparing Metamap to MGrep as a Tool for Mapping Free Text to Formal Medical Lexions , 2012, KECSM@ISWC.