Two stage genetic approach for bio-chemical named entity recognition

Determining different mentions of chemical names from texts has a wide-spread application in real life. Chemical names are complex in nature and there exist several representations and nomenclatures (like SMILES, InChI, IUPAC) which create a big challenge to their automatic identification and classification. In this paper we present a feature selection approach for appropriate feature subset selection from a well-known supervised machine learning approach namely conditional random field based classifier (CRF). Several features are identified and extracted without using any domain specific knowledge and/or resources for determining mentions of IUPAC and IUPAC-like names from scientific text using some supervised classification technique. The appropriate set of features for a particular supervised classification technique is extracted from this huge collection of features using some single objective genetic algorithm based feature selection technique. Experiments are carried out on the benchmark patent dataset. Evaluation shows encouraging performance with the overall F-measure values of 70.01% by single objective optimization based approach on patent 2008 test data set.