Chemical name recognition with harmonized feature-rich conditional random fields

This article presents a machine learning-based solution for automatic chemical and drug name recognition on scientific documents, which was applied in the BioCreative IV CHEMDNER task, namely in the chemical entity mention recognition (CEM) and the chemical document indexing (CDI) sub-tasks. The proposed approach applies conditional random fields with a rich feature set, including linguistic, orthographic, morphological, dictionary matching and local context (i.e., conjunctions) features. Post-processing modules are also integrated, performing parentheses correction and abbreviation resolution. In the end, heterogeneous CRF models are harmonized to generate improved annotations. The achieved performance results in the development set are encouraging, with F-scores of 83.71% on CEM and 82.05% on CDI.