Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital Documents

End-users utilize chemical search engines to search for chemical formulae and chemical names. Chemical search engines identify and index chemical formulae and chemical names appearing in text documents to support efficient search and retrieval in the future. Identifying chemical formulae and chemical names in text automatically has been a hard problem that has met with varying degrees of success in the past. We propose algorithms for chemical formula and chemical name tagging using Conditional Random Fields (CRFs) and Support Vector Machines (SVMs) that achieve higher accuracy than existing (published) methods. After chemical entities have been identified in text documents, they must be indexed. In order to support user-provided search queries that require a partial match between the chemical name segment used as a keyword or a partial chemical formula, all possible (or a significant number of) subformulae of formulae that appear in any document and all possible subterms (e.g., “methyl”) of chemical names (e.g., “methylethyl ketone”) must be indexed. Indexing all possible subformulae and subterms results in an exponential increase in the storage and memory requirements as well as the time taken to process the indices. We propose techniques to prune the indices significantly without reducing the quality of the returned results significantly. Finally, we propose multiple query semantics to allow users to pose different types of partial search queries for chemical entities. We demonstrate empirically that our search engines improve the relevance of the returned results for search queries involving chemical entities.

[1]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  Peter Willett,et al.  RASCAL: Calculation of Graph Similarity using Maximum Common Edge Subgraphs , 2002, Comput. J..

[4]  D. Banville Mining chemical structural information from the drug literature. , 2006, Drug discovery today.

[5]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[6]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[7]  I. V. Ramakrishnan,et al.  Exploiting Structured Reference Data for Unsupervised Text Segmentation with Conditional Random Fields , 2008, SDM.

[8]  C. Lee Giles,et al.  Learning to rank graphs for online similar graph search , 2009, CIKM.

[9]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[10]  Charles L. A. Clarke,et al.  A document-centric approach to static index pruning in text retrieval systems , 2006, CIKM '06.

[11]  Guizhen Yang,et al.  The complexity of mining maximal frequent itemsets and maximal frequent patterns , 2004, KDD.

[12]  C. Lee Giles,et al.  Independent informative subgraph mining for graph information retrieval , 2009, CIKM.

[13]  Michael F. Lynch,et al.  Extraction of Information from the Text of Chemical Patents. 1. Identification of Specific Chemical Names , 1998, J. Chem. Inf. Comput. Sci..

[14]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[15]  C. Lee Giles,et al.  Mining, indexing, and searching for textual chemical molecule information on the web , 2008, WWW.

[16]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[17]  James G. Shanahan,et al.  Boosting support vector machines for text classification through parameter-free threshold relaxation , 2003, CIKM '03.

[18]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[19]  Allen C. Browne,et al.  Analysis of biomedical text for chemical names: a comparison of three methods , 1999, AMIA.

[20]  J. Brecher Name=Struct: A Practical Approach to the Sorry State of Real-Life Chemical Nomenclature , 1999, J. Chem. Inf. Comput. Sci..

[21]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[22]  J. L. Wisniewski AUTONOM: system for computer translation of structural diagrams into IUPAC-compatible names. 1. General design , 1990, J. Chem. Inf. Comput. Sci..

[23]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[24]  D. I. Cooke-Fox,et al.  Computer translation of IUPAC systematic organic chemical nomenclature. 2. Development of a formal grammar , 1989, J. Chem. Inf. Comput. Sci..

[25]  Mario A. Nascimento,et al.  Improving Web search efficiency via a locality based static pruning method , 2005, WWW '05.

[26]  Jonathan D. Wren,et al.  A scalable machine-learning approach to recognize chemical names within large text databases , 2006, BMC Bioinformatics.

[27]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[28]  C. Lee Giles,et al.  Extraction and search of chemical formulae in text documents on the web , 2007, WWW '07.

[29]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[30]  Philip S. Yu,et al.  Feature-based Substructure Similarity Search , 2009 .

[31]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[32]  Wei Li,et al.  Semi-Supervised Sequence Modeling with Syntactic Topic Models , 2005, AAAI.

[33]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[34]  Eugene Garfield,et al.  An Algorithm for Translating Chemical Names to Molecular Formulas. , 1962 .

[35]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[36]  D. I. Cooke-Fox,et al.  Computer translation of IUPAC systematic organic chemical nomenclature. 3. Syntax analysis and semantic processing , 1989, J. Chem. Inf. Comput. Sci..

[37]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[38]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[39]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[40]  James E. Rush,et al.  Procedures for Converting Systematic Names of Organic Compounds into Atom-Bond Connection Tables , 1967 .

[41]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[42]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[43]  Peter T. Corbett,et al.  Cascaded classifiers for confidence-based chemical named entity recognition , 2008, BMC Bioinformatics.

[44]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[45]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[46]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[47]  Martin Hofmann-Apitius,et al.  Detection of IUPAC and IUPAC-like chemical names , 2008, ISMB.

[48]  Martijn J. Schuemie,et al.  A dictionary to identify small molecules and drugs in free text , 2009, Bioinform..

[49]  Alexander Vasserman Identifying Chemical Names in Biomedical Text: an Investigation of Substring Co-occurrence Based Approaches , 2004, HLT-NAACL.

[50]  Uwe Reyle,et al.  Analysing and Classifying Names of Chemical Compounds with CHEMorph , 2006, SMBM.

[51]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[52]  Simone Teufel,et al.  Annotation of Chemical Named Entities , 2007, BioNLP@ACL.

[53]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[54]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[55]  John Yen,et al.  Topic segmentation with shared topic detection and alignment of multiple documents , 2007, SIGIR.