MaxMatcher: Biological Concept Extraction Using Approximate Dictionary Lookup

Dictionary-based biological concept extraction is still the state-of-the-art approach to large-scale biomedical literature annotation and indexing. The exact dictionary lookup is a very simple approach, but always achieves low extraction recall because a biological term often has many variants while a dictionary is impossible to collect all of them. We propose a generic extraction approach, referred to as approximate dictionary lookup, to cope with term variations and implement it as an extraction system called MaxMatcher. The basic idea of this approach is to capture the significant words instead of all words to a particular concept. The new approach dramatically improves the extraction recall while maintaining the precision. In a comparative study on GENIA corpus, the recall of the new approach reaches a 57% recall while the exact dictionary lookup only achieves a 26% recall.

[1]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[2]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[3]  Jung-Hsien Chiang,et al.  Literature Extraction of Protein Functions Using Sentence Pattern Mining , 2005, IEEE Trans. Knowl. Data Eng..

[4]  Sougata Mukherjea,et al.  Information extraction from biomedical literature: methodology, evaluation and an application , 2003, CIKM '03.

[5]  Russ B. Altman,et al.  GAPSCORE: finding gene and protein names one word at a time , 2004, Bioinform..

[6]  Young-In Song,et al.  Terminology Indexing and Reweighting methods for Biomedical Text Retrieval , 2004 .

[7]  Hyoil Han,et al.  Converting Semi-structured Clinical Medical Records into Information and Knowledge , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[8]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[9]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[10]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[11]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[12]  Xiaohua Hu,et al.  Using Concept-Based Indexing to Improve Language Modeling Approach to Genomic IR , 2006, ECIR.