Unsupervised,corpus-based method for extending a biomedical terminology

Objectives: To automatically extend downwards an existing biomedical terminology using a corpus and both lexical and terminological knowledge. Methods: Adjectival modifiers are removed from terms extracted from the corpus (three million noun phrases extracted from MEDLINE), and demodified terms are searched for in the terminology (UMLS Metathesaurus, restricted to disorders and procedures). A phrase from MEDLINE becomes a candidate term in the Metathesaurus if the following two requirements are met: 1) a demodified term created from this phrase is found in the terminology and 2) the modifiers removed to create the demodified term also modify existing terms from the terminology, for a given semantic category. A manual review of a sample of candidate terms was performed. Results: Out of the 3 million simple phrases randomly extracted from MEDLINE, 125,000 new terms were identified for inclusion in the UMLS. 83% of the 1000 terms reviewed manually were associated with a relevant UMLS concept. Discussion: The limitations of this approach are discussed, as well as adaptation and generalization issues.

[1]  William T. Hole,et al.  Finding UMLS Metathesaurus concepts in MEDLINE , 2002, AMIA.

[2]  Christian Jacquemin,et al.  Term Extraction + Term Clustering: An Integrated Platform for Computer-Aided Terminology , 1999, EACL.

[3]  C. Chute,et al.  The content coverage of clinical classifications. For The Computer-Based Patient Record Institute's Work Group on Codes & Structures. , 1996, Journal of the American Medical Informatics Association : JAMIA.

[4]  Jacques Bouaud,et al.  Extending an existing specialized semantic lexicon , 1998, LREC.

[5]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[6]  Gregory Grefenstette,et al.  Corpus-Derived First, Second and Third-Order Word Affinities , 1994 .

[7]  Olivier Bodenreider,et al.  Lexically-suggested hyponymic relations among medical terms and their representation in the UMLS , 2001 .

[8]  Lawrence Hunter,et al.  Extracting Molecular Binding Relationships from Biomedical Text , 2000, ANLP.

[9]  Olivier Bodenreider,et al.  Assessing the consistency of a biomedical terminology through lexical knowledge , 2002, Int. J. Medical Informatics.

[10]  Olivier Bodenreider,et al.  Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[11]  Christopher G. Chute,et al.  A clinically derived terminology: qualification to reduction , 1997, AMIA.

[12]  D A Evans,et al.  Empirical, automated vocabulary discovery using large text corpora and advanced natural language processing tools. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.