Cascaded classifiers for confidence-based chemical named entity recognition

BackgroundChemical named entities represent an important facet of biomedical text.ResultsWe have developed a system to use character-based n-grams, Maximum Entropy Markov Models and rescoring to recognise chemical names and other such entities, and to make confidence estimates for the extracted entities. An adjustable threshold allows the system to be tuned to high precision or high recall. At a threshold set for balanced precision and recall, we were able to extract named entities at an F score of 80.7% from chemistry papers and 83.2% from PubMed abstracts. Furthermore, we were able to achieve 57.6% and 60.3% recall at 95% precision, and 58.9% and 49.1% precision at 90% recall.ConclusionThese results show that chemical named entities can be extracted with good performance, and that the properties of the extraction can be tuned to suit the demands of the task.

[1]  Heng Ji,et al.  Analysis and Repair of Name Tagger Errors , 2006, ACL.

[2]  Simone Teufel,et al.  Annotation of Chemical Named Entities , 2007, BioNLP@ACL.

[3]  Christopher D. Manning,et al.  An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition , 2006, ACL.

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Peter T. Corbett,et al.  Semantic enrichment of journal articles using chemical named entity recognition , 2007, ACL.

[6]  Allen C. Browne,et al.  Analysis of biomedical text for chemical names: a comparison of three methods , 1999, AMIA.

[7]  Alexander Vasserman Identifying Chemical Names in Biomedical Text: an Investigation of Substring Co-occurrence Based Approaches , 2004, HLT-NAACL.

[8]  Dan Roth,et al.  Probabilistic Reasoning for Entity & Relation Recognition , 2002, COLING.

[9]  K. Bretonnel Cohen,et al.  Biological, translational, and clinical language processing , 2007 .

[10]  B. Carpenter,et al.  LingPipe for 99.99% Recall of Gene Mentions , 2007 .

[11]  Andrew McCallum,et al.  Confidence Estimation for Information Extraction , 2004, NAACL.

[12]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[13]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[14]  Heng Ji,et al.  Applying Coreference to Improve Name Recognition , 2004 .

[15]  Andrew Y. Ng,et al.  Solving the Problem of Cascading Errors: Approximate Bayesian Inference for Linguistic Annotation Pipelines , 2006, EMNLP.

[16]  Simone Teufel,et al.  Language Technology for Processing Chemistry Publications , 2005 .

[17]  Beatrice Alex,et al.  Assisted Curation: Does Text Mining Really Help? , 2007, Pacific Symposium on Biocomputing.

[18]  Uwe Reyle Understanding chemical terminology , 2006 .

[19]  Alan Bundy,et al.  Proceedings of the UK e-Science All Hands Meeting 2006 , 2006 .

[20]  Dan Roth,et al.  A Linear Programming Formulation for Global Inference in Natural Language Tasks , 2004, CoNLL.

[21]  Ivan Janciak,et al.  UK e-Science All Hands Meeting , 2009 .

[22]  Kazuhiro Yoshida Jun Reranking for Biomedical Named-Entity Recognition , 2007 .

[23]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[24]  Heng Ji,et al.  Improving Name Tagging by Reference Resolution and Relation Detection , 2005, ACL.