Chemical named entities recognition: a review on approaches and applications

The rapid increase in the flow rate of published digital information in all disciplines has resulted in a pressing need for techniques that can simplify the use of this information. The chemistry literature is very rich with information about chemical entities. Extracting molecules and their related properties and activities from the scientific literature to “text mine” these extracted data and determine contextual relationships helps research scientists, particularly those in drug development. One of the most important challenges in chemical text mining is the recognition of chemical entities mentioned in the texts. In this review, the authors briefly introduce the fundamental concepts of chemical literature mining, the textual contents of chemical documents, and the methods of naming chemicals in documents. We sketch out dictionary-based, rule-based and machine learning, as well as hybrid chemical named entity recognition approaches with their applied solutions. We end with an outlook on the pros and cons of these approaches and the types of chemical entities extracted.

[1]  Pierre Zweigenbaum,et al.  Automatic Extraction of semantic relations between medical entities: Application to the treatment relation , 2010, Semantic Mining in Biomedicine.

[2]  Dina Demner-Fushman,et al.  Biomedical Text Mining: A Survey of Recent Progress , 2012, Mining Text Data.

[3]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[4]  Juan M. Corchado,et al.  Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living, 10th International Work-Conference on Artificial Neural Networks, IWANN 2009 Workshops, Salamanca, Spain, June 10-12, 2009. Proceedings, Part II , 2009, IWANN.

[5]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[6]  J. Brecher Name=Struct: A Practical Approach to the Sorry State of Real-Life Chemical Nomenclature , 1999, J. Chem. Inf. Comput. Sci..

[7]  Kazuhiro Saitou,et al.  Automated extraction of chemical structure information from digital raster images , 2009, Chemistry Central journal.

[8]  David Nadeau,et al.  Semi-supervised named entity recognition: learning to recognize 100 entity types with little supervision , 2007 .

[9]  Yorick Wilks,et al.  University of Sheffield: Description of the LaSIE System as Used for MUC-6 , 1995, MUC.

[10]  D. Banville Mining chemical structural information from the drug literature. , 2006, Drug discovery today.

[11]  Hanna M. Wallach,et al.  Conditional Random Fields: An Introduction , 2004 .

[12]  Paolo Rosso,et al.  Conditional Random Fields vs. Hidden Markov Models in a biomedical Named Entity Recognition task , 2007 .

[13]  Dietrich Rebholz-Schuhmann,et al.  Identification of Chemical Entities in Patent Documents , 2009, IWANN.

[14]  Martijn J. Schuemie,et al.  A dictionary to identify small molecules and drugs in free text , 2009, Bioinform..

[15]  José Luís Oliveira,et al.  Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools , 2012 .

[16]  Peter Murray-Rust,et al.  ChemicalTagger: A tool for semantic text-mining in chemistry , 2011, J. Cheminformatics.

[17]  J. Oliveira,et al.  Chemical name recognition with harmonized feature-rich conditional random fields , 2013 .

[18]  Martin Hofmann-Apitius,et al.  Chemical Names: Terminological Resources and Corpora Annotation , 2008, LREC 2008.

[19]  Bartosz Broda,et al.  Fextor: A Feature Extraction Framework for Natural Language Processing: A Case Study in Word Sense Disambiguation, Relation Recognition and Anaphora Resolution , 2013, Computational Linguistics - Applications.

[20]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[21]  Justin Zobel,et al.  NEROC : Named Entity Recognizer of Chemicals , 2013 .

[22]  Samuel Reese,et al.  FreeLing 2.1: Five Years of Open-source Language Processing Tools , 2010, LREC.

[23]  Lang Li,et al.  Literature mining on pharmacokinetics numerical data: A feasibility study , 2009, J. Biomed. Informatics.

[24]  Jan A. Kors,et al.  A dictionary-and grammar-based chemical named entity recognizer , 2013 .

[25]  Lilly Suriani Affendey,et al.  Named entity recognition approaches , 2008 .

[26]  Francesc Solsona,et al.  A tool for the identification of chemical entities ( CheNER-BioC ) , 2013 .

[27]  Sandra Bergmann,et al.  Information Extraction from Chemical patents , 2012, Comput. Sci..

[28]  U. Leser,et al.  Extended Feature Set for Chemical Named Entity Recognition and Indexing , 2013 .

[29]  Kristina Voigt,et al.  An evaluation of online databases by methods of lattice theory , 1995 .

[30]  Russ B Altman,et al.  Extracting and characterizing gene-drug relationships from the literature. , 2004, Pharmacogenetics.

[31]  Jan H. Noordik,et al.  Chemical reaction searching compared in REACCS, SYNLIB, and ORAC , 1988, J. Chem. Inf. Comput. Sci..

[32]  James W. Cooper,et al.  Text analytics for life science using the Unstructured Information Management Architecture , 2004, IBM Syst. J..

[33]  Tolga Can,et al.  DBCHEM : A Database Query Based Solution for the Chemical Compound and Drug Name Recognition Task , 2013 .

[34]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[35]  Thomas C. Wiegers,et al.  Text Mining Effectively Scores and Ranks the Literature for Improving Chemical-Gene-Disease Curation at the Comparative Toxicogenomics Database , 2013, PloS one.

[36]  Bradley M. Hemminger,et al.  Mining connections between chemicals, proteins, and diseases extracted from Medline annotations , 2010, J. Biomed. Informatics.

[37]  Leonardo Campillos,et al.  Recognizing Chemical Compounds and Drugs : a Rule-Based Approach Using Semantic Information , 2013 .

[38]  M.-E. Algorri,et al.  Reconstruction of Chemical Molecules from Images , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[39]  Isabel Segura-Bedmar,et al.  Drug name recognition and classification in biomedical texts. A case study outlining approaches underpinning automated systems. , 2008, Drug discovery today.

[40]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[41]  Dietrich Rebholz-Schuhmann,et al.  EBIMed - text crunching to gather facts for proteins from Medline , 2007, Bioinform..

[42]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[43]  Yorick Wilks,et al.  University of Sheffield: description of the LaSIE system as used for MUC-6 , 1995, MUC.

[44]  Martin Hofmann-Apitius,et al.  Detection of IUPAC and IUPAC-like chemical names , 2008, ISMB.

[45]  Daniel M. Lowe,et al.  LeadMine : A grammar and dictionary driven approach to chemical entity recognition , 2013 .

[46]  Zhiyong Lu,et al.  NCBI at the BioCreative IV CHEMDNER Task : Recognizing chemical names in PubMed articles with tmChem , 2013 .

[47]  Roman Grundkiewicz,et al.  Automatic Extraction of Polish Language Errors from Text Edition History , 2013, TSD.

[48]  Tiejun Zhao,et al.  Biomedical Named Entity Recognition Based on Classifiers Ensemble , 2008, Int. J. Comput. Sci. Appl..

[49]  Rong Xu,et al.  A Semi-Supervised Pattern-Learning Approach to Extract Pharmacogenomics-Specific Drug-Gene Pairs from Biomedical Literature , 2013 .

[50]  Sophia Ananiadou,et al.  Chemistry-specific Features and Heuristics for Developing a CRF-based Chemical Named Entity Recogniser , 2013 .

[51]  Adam Radziszewski,et al.  WCCL: A Morpho-syntactic Feature Toolkit , 2011, TSD.

[52]  Simone Teufel,et al.  Annotation of Chemical Named Entities , 2007, BioNLP@ACL.

[53]  C. Lee Giles,et al.  Extraction and search of chemical formulae in text documents on the web , 2007, WWW '07.

[54]  Yoshihiro Yamanishi,et al.  Relating drug–protein interaction network with drug side effects , 2012, Bioinform..

[55]  Yue-Shi Lee,et al.  Extracting Named Entities Using Support Vector Machines , 2006, KDLL.

[56]  Eric G. Bremer Knowledge Discovery in Life Science Literature, PAKDD 2006 International Workshop, KDLL 2006, Singapore, April 9, 2006, Proceedings , 2006, KDLL.

[57]  Maksim Tkatchenko,et al.  Named entity recognition: Exploring features , 2012, KONVENS.

[58]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[59]  Hui Yang,et al.  A Verb-Centric Approach for Relationship Extraction in Biomedical Text , 2010, 2010 IEEE Fourth International Conference on Semantic Computing.

[60]  Hwee Tou Ng,et al.  Named Entity Recognition: A Maximum Entropy Approach Using Global Information , 2002, COLING.

[61]  A. Valencia,et al.  Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications , 2011, Molecular informatics.

[62]  A. Valencia,et al.  Overview of the chemical compound and drug name recognition ( CHEMDNER ) task , 2013 .

[63]  Jonathan D. Wren,et al.  A scalable machine-learning approach to recognize chemical names within large text databases , 2006, BMC Bioinformatics.

[64]  Catia Pesquita,et al.  Chemical Entity Recognition and Resolution to ChEBI , 2012, ISRN bioinformatics.

[65]  Stéphane Bressan,et al.  Association rules mining for name entity recognition , 2003, Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003. WISE 2003..

[66]  C. Lee Giles,et al.  Mining, indexing, and searching for textual chemical molecule information on the web , 2008, WWW.

[67]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[68]  K. Bretonnel Cohen,et al.  Biological, translational, and clinical language processing , 2007 .

[69]  Masaharu YOSHIOKA,et al.  Ensemble Approach to Extract Chemical Named Entity by Using Results of Multiple CNER Systems with Different Characteristic , 2013 .

[70]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[71]  Peter T. Corbett,et al.  Cascaded classifiers for confidence-based chemical named entity recognition , 2008, BMC Bioinformatics.

[72]  Simone Teufel,et al.  Language Technology for Processing Chemistry Publications , 2005 .

[73]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[74]  Andre Lamurias,et al.  Chemical compound and drug name recognition using CRFs and semantic similarity based on ChEBI , 2013 .

[75]  Rabiah Abdul Kadir,et al.  Overview of Biomedical Relations Extraction using Hybrid Rule-based Approaches , 2013 .

[76]  Taiwo Oladipupo Ayodele,et al.  Types of Machine Learning Algorithms , 2010 .

[77]  Alfonso Valencia,et al.  CheNER: chemical named entity recognizer , 2014, Bioinform..