论文信息 - Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics - 字舞流文

Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics

BackgroundThe development of robust methods for chemical named entity recognition, a challenging natural language processing task, was previously hindered by the lack of publicly available, large-scale, gold standard corpora. The recent public release of a large chemical entity-annotated corpus as a resource for the CHEMDNER track of the Fourth BioCreative Challenge Evaluation (BioCreative IV) workshop greatly alleviated this problem and allowed us to develop a conditional random fields-based chemical entity recogniser. In order to optimise its performance, we introduced customisations in various aspects of our solution. These include the selection of specialised pre-processing analytics, the incorporation of chemistry knowledge-rich features in the training and application of the statistical model, and the addition of post-processing rules.ResultsOur evaluation shows that optimal performance is obtained when our customisations are integrated into the chemical entity recogniser. When its performance is compared with that of state-of-the-art methods, under comparable experimental settings, our solution achieves competitive advantage. We also show that our recogniser that uses a model trained on the CHEMDNER corpus is suitable for recognising names in a wide range of corpora, consistently outperforming two popular chemical NER tools.ConclusionThe contributions resulting from this work are two-fold. Firstly, we present the details of a chemical entity recognition methodology that has demonstrated performance at a competitive, if not superior, level as that of state-of-the-art methods. Secondly, the developed suite of solutions has been made publicly available as a configurable workflow in the interoperable text mining workbench Argo. This allows interested users to conveniently apply and evaluate our solutions in the context of other chemical text mining tasks.

Sophia Ananiadou | Riza Theresa Batista-Navarro | Rafal Rak | S. Ananiadou | R. Batista-Navarro | Rafal Rak

[1] Martijn J. Schuemie,et al. A dictionary to identify small molecules and drugs in free text , 2009, Bioinform..

[2] Yanli Wang,et al. PubChem: Integrated Platform of Small Molecules and Biological Activities , 2008 .

[3] David S. Wishart,et al. DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs , 2010, Nucleic Acids Res..

[4] Brandon Barker,et al. Genomic analysis of gene regulation complexity , 2008, BMC Bioinformatics.

[5] Catia Pesquita,et al. Chemical Entity Recognition and Resolution to ChEBI , 2012, ISRN bioinformatics.

[6] Thomas C. Wiegers,et al. Text Mining Effectively Scores and Ranks the Literature for Improving Chemical-Gene-Disease Curation at the Comparative Toxicogenomics Database , 2013, PloS one.

[7] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8] Paloma Martínez,et al. SemEval-2013 Task 9 : Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013) , 2013, *SEMEVAL.

[9] Xu Han,et al. An integrated pharmacokinetics ontology and corpus for text mining , 2013, BMC Bioinformatics.

[10] A. Persidis,et al. Drug repurposing and adverse event prediction using high‐throughput literature analysis , 2011, Wiley interdisciplinary reviews. Systems biology and medicine.

[11] Ulf Leser,et al. ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[12] Luca Toldo,et al. Challenges in mining the literature for chemical information , 2013 .

[13] Christoph Steinbeck,et al. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013 , 2012, Nucleic Acids Res..

[14] Finn Verner Jensen,et al. Bayesian networks , 1998, Data Mining and Knowledge Discovery Handbook.

[15] Martin Hofmann-Apitius,et al. Chemical Names: Terminological Resources and Corpora Annotation , 2008, LREC 2008.

[16] Paloma Martínez,et al. The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[17] D. Banville. Mining chemical structural information from the drug literature. , 2006, Drug discovery today.

[18] Peter T. Corbett,et al. Cascaded classifiers for confidence-based chemical named entity recognition , 2008, BMC Bioinformatics.

[19] Sophia Ananiadou,et al. Processing biological literature with customizable Web services supporting interoperable formats , 2014, Database J. Biol. Databases Curation.

[20] Simone Teufel,et al. Annotation of Chemical Named Entities , 2007, BioNLP@ACL.

[21] Dietrich Rebholz-Schuhmann,et al. Calbc Silver Standard Corpus , 2010, J. Bioinform. Comput. Biol..

[22] Alfonso Valencia,et al. CHEMDNER: The drugs and chemical names extraction challenge , 2015, Journal of Cheminformatics.

[23] Dietrich Rebholz-Schuhmann,et al. Biological network extraction from scientific literature: state of the art and challenges , 2014, Briefings Bioinform..

[24] Sophia Ananiadou,et al. Mining metabolites: extracting the yeast metabolome from the literature , 2010, Metabolomics.

[25] Sophia Ananiadou,et al. Argo: an integrative, interactive, text mining-based workbench supporting curation , 2012, Database J. Biol. Databases Curation.

[26] Thomas C. Wiegers,et al. The Comparative Toxicogenomics Database: update 2013 , 2012, Nucleic Acids Res..

[27] Dietrich Rebholz-Schuhmann,et al. Identification of Chemical Entities in Patent Documents , 2009, IWANN.

[28] Sophia Ananiadou,et al. Chemistry-specific Features and Heuristics for Developing a CRF-based Chemical Named Entity Recogniser , 2013 .

[29] Marti A. Hearst,et al. A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[30] César de Pablo-Sánchez,et al. Extracting drug-drug interactions from biomedical texts , 2010, BMC Bioinformatics.

[31] Sophia Ananiadou,et al. Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry , 2011, PloS one.

[32] Martin Hofmann-Apitius,et al. Linking Chemical and Biological Information with Natural Language Processing , 2008 .

[33] A. Valencia,et al. Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications , 2011, Molecular informatics.

[34] Egon L. Willighagen,et al. OSCAR4: a flexible architecture for chemical text-mining , 2011, J. Cheminformatics.

[35] Zhiyong Lu,et al. The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[36] David S. Wishart,et al. HMDB: a knowledgebase for the human metabolome , 2008, Nucleic Acids Res..

[37] Sophia Ananiadou,et al. Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.