Catching the Drift - Indexing Implicit Knowledge in Chemical Digital Libraries

In the domain of chemistry the information gathering process is highly focused on chemical entities. But due to synonyms and different entity representations the indexing of chemical documents is a challenging process. Considering the field of drug design, the task is even more complex. Domain experts from this field are usually not interested in any chemical entity itself, but in representatives of some chemical class showing a specific reaction behavior. For describing such a reaction behavior of chemical entities the most interesting parts are their functional groups. The restriction of each chemical class is somehow also related to the entities' reaction behavior, but further based on the chemist's implicit knowledge. In this paper we present an approach dealing with this implicit knowledge by clustering chemical entities based on their functional groups. However, since such clusters are generally too unspecific, containing chemical entities from different chemical classes, we further divide them into sub-clusters using fingerprint based similarity measures. We analyze several uncorrelated fingerprint/similarity measure combinations and show that the most similar entities with respect to a query entity can be found in the respective sub-cluster. Furthermore, we use our approach for document retrieval introducing a new similarity measure based on Wikipedia categories. Our evaluation shows that the sub-clustering leads to suitable results enabling sophisticated document retrieval in chemical digital libraries.

[1]  P Willett,et al.  Similarity-based approaches to virtual screening. , 2003, Biochemical Society transactions.

[2]  Wolf-Tilo Balke,et al.  Exposing the hidden web for chemical digital libraries , 2010, JCDL '10.

[3]  Encoding Rules,et al.  SMILES, a Chemical Language and Information System. 1. Introduction to Methodology , 1988 .

[4]  P Willett,et al.  Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. , 2002, Combinatorial chemistry & high throughput screening.

[5]  John M. Barnard,et al.  Clustering Methods and Their Uses in Computational Chemistry , 2003 .

[6]  Michel Dumontier,et al.  CO: A chemical ontology for identification of functional groups and semantic comparison of small molecules , 2005, FEBS letters.

[7]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[8]  David Bawden,et al.  Comparison of hierarchical cluster analysis techniques for automatic classification of chemical structures , 1981, J. Chem. Inf. Comput. Sci..

[9]  Norbert Haider,et al.  Functionality Pattern Matching as an Efficient Complementary Structure/Reaction Search Tool: an Open-Source Approach , 2010, Molecules.

[10]  H. L. Morgan The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. , 1965 .

[11]  Z. Hubálek COEFFICIENTS OF ASSOCIATION AND SIMILARITY, BASED ON BINARY (PRESENCE‐ABSENCE) DATA: AN EVALUATION , 1982 .

[12]  D. J. Gluck,et al.  A Chemical Structure Storage and Search System Developed at Du Pont. , 1965 .

[13]  Wolf-Tilo Balke,et al.  Taking chemistry to the task: personalized queries for chemical digital libraries , 2011, JCDL '11.

[14]  Wolf-Tilo Balke,et al.  Using Wikipedia categories for compact representations of chemical documents , 2010, CIKM '10.

[15]  S. Heller,et al.  An Open Standard for Chemical Structure Representation: The IUPAC Chemical Identifier , 2003 .

[16]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[17]  Ingrid Fischer,et al.  Computational life sciences II , 2005 .

[18]  William J. Wiswesser,et al.  The Wiswesser line-formula chemical notation , 1968 .

[19]  Andrew I Su,et al.  HierS: hierarchical scaffold clustering using topological chemical graphs. , 2005, Journal of medicinal chemistry.

[20]  Peter Murray-Rust,et al.  Chemical documents: machine understanding and automated information extraction. , 2004, Organic & biomolecular chemistry.

[21]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..