Chemical entity extraction using CRF and an ensemble of extractors

BackgroundAs we are witnessing a great interest in identifying and extracting chemical entities in academic articles, many approaches have been proposed to solve this problem. In this work we describe a probabilistic framework that allows for the output of multiple information extraction systems to be combined in a systematic way. The identified entities are assigned a probability score that reflects the extractors' confidence, without the need for each individual extractor to generate a probability score. We quantitively compared the performance of multiple chemical tokenizers to measure the effect of tokenization on extraction accuracy. Later, a single Conditional Random Fields (CRF) extractor that utilizes the best performing tokenizer is built using a unique collection of features such as word embeddings and Soundex codes, which, to the best of our knowledge, has not been explored in this context before,ResultsThe ensemble of multiple extractors outperforms each extractor's individual performance during the CHEMDNER challenge. When the runs were optimized to favor recall, the ensemble approach achieved the second highest recall on unseen entities. As for the single CRF model with novel features, the extractor achieves an F1 score of 83.3% on the test set, without any post processing or abbreviation matching.ConclusionsEnsemble information extraction is effective when multiple stand alone extractors are to be used, and produces higher performance than individual off the shelf extractors. The novel features introduced in the single CRF model are sufficient to achieve very competitive F1 score using a simple standalone extractor.

[1]  Walter Daelemans,et al.  Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 , 2003 .

[2]  Martijn J. Schuemie,et al.  A dictionary to identify small molecules and drugs in free text , 2009, Bioinform..

[3]  Madian Khabsa,et al.  An Ensemble Information Extraction Approach to the BioCreative CHEMDNER Task , 2013 .

[4]  U. Leser,et al.  Extended Feature Set for Chemical Named Entity Recognition and Indexing , 2013 .

[5]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Sunghwan Sohn,et al.  Abbreviation definition identification based on automatic precision estimates , 2008, BMC Bioinformatics.

[8]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[9]  C. Lee Giles,et al.  Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital Documents , 2011, TOIS.

[10]  C. Lee Giles,et al.  Extraction and search of chemical formulae in text documents on the web , 2007, WWW '07.

[11]  Egon L. Willighagen,et al.  OSCAR4: a flexible architecture for chemical text-mining , 2011, J. Cheminformatics.

[12]  Radu Florian,et al.  Named Entity Recognition as a House of Cards: Classifier Stacking , 2002, CoNLL.

[13]  Yoram Singer,et al.  A simple, fast, and effective rule learner , 1999, AAAI 1999.

[14]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[15]  Alfonso Valencia,et al.  CHEMDNER: The drugs and chemical names extraction challenge , 2015, Journal of Cheminformatics.

[16]  Axel Drefahl,et al.  CurlySMILES: a chemical language to customize and annotate encodings of molecular and nanodevice structures , 2011, J. Cheminformatics.

[17]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[18]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[19]  Ben Taskar,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[20]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[21]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[22]  Masaharu YOSHIOKA,et al.  Ensemble Approach to Extract Chemical Named Entity by Using Results of Multiple CNER Systems with Different Characteristic , 2013 .

[23]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[24]  C. Lee Giles,et al.  Mining, indexing, and searching for textual chemical molecule information on the web , 2008, WWW.

[25]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[26]  Zhiyong Lu,et al.  NCBI at the BioCreative IV CHEMDNER Task : Recognizing chemical names in PubMed articles with tmChem , 2013 .

[27]  Andreas Evers,et al.  Phototoxicity – from molecular descriptors to classification models , 2011, J. Cheminformatics.

[28]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.