Massive Biomedical Term Discovery

Most technical and scientific terms are comprised of complex, multi-word noun phrases but certainly not all noun phrases are technical or scientific terms. The distinction of specific terminology from common non-specific noun phrases can be based on the observation that terms reveal a much lesser degree of distributional variation than non-specific noun phrases. We formalize the limited paradigmatic modifiability of terms and, subsequently, test the corresponding algorithm on bigram, trigram and quadgram noun phrases extracted from a 104-million-word biomedical text corpus. Using an already existing and community-wide curated biomedical terminology as an evaluation gold standard, we show that our algorithm significantly outperforms standard term identification measures and, therefore, qualifies as a high-performant building block for any terminology identification system.

[1]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[2]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3]  Fred J. Damerau,et al.  Generating and Evaluating Domain-Oriented Multi-Word Terms from Texts , 1993, Inf. Process. Manag..

[4]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[5]  Hideki Mima,et al.  The ATRACT Workbench: Automatic Term Recognition and Clustering for Terms , 2001, TSD.

[6]  Udo Hahn,et al.  Collocation Extraction Based on Modifiability Statistics , 2004, COLING.

[7]  C. Lindberg The Unified Medical Language System (UMLS) of the National Library of Medicine. , 1990, Journal.

[8]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[9]  Goran Nenadic,et al.  Terminology-driven mining of biomedical literature , 2003, SAC '03.

[10]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[11]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[12]  Goran Nenadic,et al.  Enhancing automatic term recognition through recognition of variation , 2004, COLING.

[13]  Van Nguyen,et al.  Modular Text Processing System Based on the SPECIALIST Lexicon and Lexical Tools , 1998, AMIA.

[14]  F B ROGERS,et al.  Medical Subject Headings , 1948, Nature.

[15]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[16]  D A Evans,et al.  Empirical, automated vocabulary discovery using large text corpora and advanced natural language processing tools. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[17]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[18]  Nigel Collier,et al.  Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain , 2001 .

[19]  Olivier Bodenreider,et al.  Unsupervised,corpus-based method for extending a biomedical terminology , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[20]  James J. Cimino,et al.  Review: Beyond the Superhighway: Exploiting the Internet with Medical Informatics , 1997, J. Am. Medical Informatics Assoc..

[21]  Lawrence Hunter,et al.  Mining molecular binding terminology from biomedical text , 1999, AMIA.