Combining Terminology Resources and Statistical Methods for Entity Recognition: an Evaluation

Terminologies and other knowledge resources are widely used to aid entity recognition in specialist domain texts. As well as providing lexicons of specialist terms, linkage from the text back to a resource can make additional knowledge available to applications. Use of such resources is especially pertinent in the biomedical domain, where large numbers of these resources are available, and where they are widely used in informatics applications. Terminology resources can be most readily used by simple lexical lookup of terms in the text. A major drawback with such lexical lookup, however, is poor precision caused by ambiguity between domain terms and general language words. We combine lexical lookup with simple filtering of ambiguous terms, to improve precision. We compare this lexical lookup with a statistical method of entity recognition, and to a method which combines the two approaches. We show that the combined method boosts precision with little loss of recall, and that linkage from recognised entities back to the domain knowledge resources can be maintained.

[1]  Sophia Ananiadou,et al.  Automatic Terminology Management in Biomedicine , 2006 .

[2]  Yuji Matsumoto,et al.  Protein Name Tagging for Biomedical Annotation in Text , 2003, BioNLP@ACL.

[3]  Proux,et al.  Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. , 1998, Genome informatics. Workshop on Genome Informatics.

[4]  Martijn J. Schuemie,et al.  Word Sense Disambiguation in the Biomedical Domain: An Overview , 2005, J. Comput. Biol..

[5]  Burkhard Rost,et al.  Protein names precisely peeled off free text , 2004, ISMB/ECCB.

[6]  Olivier Bodenreider,et al.  Evaluating UMLS strings for natural language processing , 2001, AMIA.

[7]  Angus Roberts,et al.  The CLEF Corpus: Semantic Annotation of Clinical Text , 2007, AMIA.

[8]  A. Aronson Filtering the UMLS ® Metathesaurus ® for MetaMap 2010 Edition , 1991 .

[9]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[10]  Kalina Bontcheva,et al.  SVM Based Learning System for Information Extraction , 2004, Deterministic and Statistical Methods in Machine Learning.

[11]  A. McCray,et al.  The Lexical Properties of the Gene Ontology ( GO ) , 2002 .

[12]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[13]  Serguei V. S. Pakhomov,et al.  High Throughput Modularized NLP System for Clinical Text , 2005, ACL.

[14]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[15]  Alan L. Rector,et al.  CLEF - Joining up Healthcare with Clinical and Post-Genomic Research , 2003 .

[16]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[17]  Alexander A. Morgan,et al.  Rutabaga by any other name: extracting biological names , 2002, J. Biomed. Informatics.

[18]  Olivier Bodenreider,et al.  The lexical properties of the gene ontology , 2002, AMIA.

[19]  Angus Roberts,et al.  A Large-Scale Resource for Storing and Recognizing Technical Terminology , 2004, LREC.

[20]  Mark Stevenson,et al.  Using Corpus-derived Name Lists for Named Entity Recognition , 2000, ANLP.