Recognition of Chemical Entities using Pattern Matching and Functional Group Classification

The two main challenges in chemical entity recognition are: i New chemical compounds are constantly being synthesized infinitely. ii High ambiguity in chemical representation in which a chemical entity is being described by different nomenclatures. Therefore, the identification and maintenance of chemical terminologies is a tough task. Since most of the existing text mining methods followed the term-based approaches, the problems of polysemy and synonymy came into the picture. So, a Named Entity Recognition NER system based on pattern matching in chemical domain is developed to extract the chemical entities from chemical documents. The Tf-idf and PMI association measures are used to filter out the non-chemical terms. The F-score of 92.19% is achieved for chemical NER. This proposed method is compared with the baseline method and other existing approaches. As the final step, the filtered chemical entities are classified into sixteen functional groups. The classification is done using SVM One against All multiclass classification approach and achieved the accuracy of 87%. One-way ANOVA is used to test the quality of pattern matching method with the other existing chemical NER methods.

[1]  Naftali Tishby,et al.  Margin based feature selection - theory and algorithms , 2004, ICML.

[2]  Lipo Wang,et al.  A GA-based RBF classifier with class-dependent features , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[3]  Jesús S. Aguilar-Ruiz,et al.  SOAP: Efficient Feature Selection of Numeric Attributes , 2002, IBERAMIA.

[4]  Il-Seok Oh,et al.  Using class separation for feature analysis and combination of class-dependent features , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[5]  Martijn J. Schuemie,et al.  A dictionary to identify small molecules and drugs in free text , 2009, Bioinform..

[6]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[7]  Alexander Vasserman Identifying Chemical Names in Biomedical Text: an Investigation of Substring Co-occurrence Based Approaches , 2004, HLT-NAACL.

[8]  Lipo Wang,et al.  Data dimensionality reduction with application to simplifying RBF network structure and improving classification performance , 2003, IEEE Trans. Syst. Man Cybern. Part B.

[9]  Arputharaj Kannan,et al.  Prediction of User Interests for Providing Relevant Information Using Relevance Feedback and Re-ranking , 2015, Int. J. Intell. Inf. Technol..

[10]  Simone Teufel,et al.  Annotation of Chemical Named Entities , 2007, BioNLP@ACL.

[11]  Ching Y. Suen,et al.  Analysis of Class Separation and Combination of Class-Dependent Features for Handwriting Recognition , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[13]  Daniel Sánchez-Cisneros,et al.  UEM-UC3M: An Ontology-based named entity recognition system for biomedical texts. , 2013, *SEMEVAL.

[14]  Daniel M. Lowe,et al.  Annotated Chemical Patent Corpus: A Gold Standard for Text Mining , 2014, PloS one.

[15]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[16]  Martin Hofmann-Apitius,et al.  Detection of IUPAC and IUPAC-like chemical names , 2008, ISMB.

[17]  T. V. Geetha,et al.  A Graph Based Query Focused Multi-Document Summarization , 2014, Int. J. Intell. Inf. Technol..

[18]  Andreas Vlachos,et al.  Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain , 2006, BioNLP@NAACL-HLT.

[19]  Sophia Ananiadou,et al.  Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry , 2011, PloS one.

[20]  Catia Pesquita,et al.  Chemical Entity Recognition and Resolution to ChEBI , 2012, ISRN bioinformatics.

[21]  Li Zhang,et al.  Focused named entity recognition using machine learning , 2004, SIGIR '04.