Terminology-driven mining of biomedical literature

In this paper we present an overview of an integrated framework for terminology-driven mining from biomedical literature. The framework integrates the following components: automatic term recognition, term variation handling, acronym acquisition, automatic discovery of term similarities and term clustering. The term variant recognition is incorporated into terminology recognition process by taking into account orthographical, morphological, syntactic, lexico-semantic and pragmatic term variations. In particular, we address acronyms as a common way of introducing term variants in biomedical papers. Term clustering is based on the automatic discovery of term similarities. We use a hybrid similarity measure, where terms are compared by using both internal and external evidence. The measure combines lexical, syntactical and contextual similarity. Experiments on terminology recognition and structuring performed on a corpus of biomedical abstracts are presented.

[1]  Lawrence Hunter,et al.  Extracting Molecular Binding Relationships from Biomedical Text , 2000, ANLP.

[2]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[4]  Sophia Ananiadou,et al.  Automatic Discovery of Term Similarities Using Pattern Mining , 2002, COLING-02 on COMPUTERM 2002 second international workshop on computational terminology -.

[5]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[6]  Goran Nenadic,et al.  Automatic Acronym Acquisition and Term Variation Management within Domain-Specific Texts , 2002, LREC.

[7]  Daniel P. Fasulo,et al.  An Analysis of Recent Work on Clustering Algorithms , 1999 .

[8]  Didier Bourigault,et al.  Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases , 1992, COLING.

[9]  James Pustejovsky,et al.  Extraction and Disambiguation of Acronym Meaning-Pairs in Medline , 2001 .

[10]  L. Brooke The National Library of Medicine. , 1980, Hospital libraries.

[11]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[12]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[13]  Hideki Mima,et al.  The ATRACT Workbench: Automatic Term Recognition and Clustering for Terms , 2001, TSD.

[14]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[15]  Russ B. Altman,et al.  Ontology Development for a Pharmacogenetics Knowledge Base , 2001, Pacific Symposium on Biocomputing.

[16]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[17]  R. Gaizauskas,et al.  Term Recognition and Classification in Biological Science Journal Articles , 1998 .

[18]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[19]  Goran Nenadic,et al.  Supervised Learning of Term Similarities , 2002, IDEAL.