Exposing the hidden web for chemical digital libraries

In recent years, the vast amount of digitally available content has lead to the creation of many topic-centered digital libraries. Also in the domain of chemistry more and more digital collections are available, but the complex query formulation still hampers their intuitive adoption. This is because information seeking in chemical documents is focused on chemical entities, for which current standard search relies on complex structures which are hard to extract from documents. Moreover, although simple keyword searches would often be sufficient, current collections simply cannot be indexed by Web search providers due to the ambiguity of chemical substance names. In this paper we present a framework for automatically generating metadata-enriched index pages for all documents in a given chemical collection. All information is then linked to the respective documents and thus provides an easy to crawl metadata repository promising to open up digital chemical libraries. Our experiments, indexing an open access journal, show that not only the documents can be found using a simple Google search via the automatically created index pages, but also that the quality of the search is much more efficient than fulltext indexing in terms of both precision/recall and performance. Finally, we compare our indexing against a classical structure search and figured out that keyword-based search can indeed solve at least some of the daily tasks in chemical workflows. To use our framework thus promises to expose a large part of the currently still hidden chemical Web, making the techniques employed interesting for chemical information providers like digital libraries and open access journals.

[1]  Stuart L. Schreiber,et al.  Query Chem: a Google-powered web search combining text and chemical structures , 2006, Bioinform..

[2]  Joe R. McDaniel,et al.  Kekule: OCR-optical chemical (structure) recognition , 1992, J. Chem. Inf. Comput. Sci..

[3]  Peter Murray-Rust,et al.  Chemical documents: machine understanding and automated information extraction. , 2004, Organic & biomolecular chemistry.

[4]  William J. Wiswesser,et al.  The Wiswesser line-formula chemical notation , 1968 .

[5]  C. Lee Giles,et al.  Mining, indexing, and searching for textual chemical molecule information on the web , 2008, WWW.

[6]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[7]  R. Webster Homer,et al.  SYBYL Line Notation (SLN): A Versatile Language for Chemical Structure Representation , 1997, J. Chem. Inf. Comput. Sci..

[8]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[9]  M. F. Lynch,et al.  The Sheffield Generic Structures Project - A Retrospective Review , 1997 .

[10]  Igor V. Filippov,et al.  Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution , 2009, J. Chem. Inf. Model..

[11]  K. N. Dollman,et al.  - 1 , 1743 .

[12]  John M. Barnard,et al.  A Universal Structure/Substructure Representation for PC-Host Communication , 1989 .

[13]  Στυλιανός Λαμπάκης Βυζαντινή και μεταβυζαντινή βιβλιογραφία: δημοσιεύματα Ελλήνων ετών 1981 - 1982 (1983) , 1983 .

[14]  S. Heller,et al.  An Open Standard for Chemical Structure Representation: The IUPAC Chemical Identifier , 2003 .

[15]  A. Peter Johnson,et al.  CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition , 2009, J. Chem. Inf. Model..

[16]  Jean Carletta,et al.  An annotation scheme for discourse-level argumentation in research articles , 1999, EACL.

[17]  H. L. Morgan The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. , 1965 .

[18]  Maria Liakata,et al.  Semantic Annotation of Papers: Interface & Enrichment Tool (SAPIENT) , 2009, BioNLP@HLT-NAACL.

[19]  Roald Hoffmann,et al.  Representation in Chemistry , 1991 .

[20]  Wendy A. Warr Chemical Structure Information Systems: Interfaces, Communication, and Standards , 1989 .

[21]  P. V. Danckwerts Angewandte chemie : International edition in English. (Published monthly under the auspices of Gesellschaft Deutscher Chemiker by Verlag Chemie GmbH). $2 per issue; $15 per volume , 1962 .

[22]  C. Lee Giles,et al.  Extraction and search of chemical formulae in text documents on the web , 2007, WWW '07.

[23]  D. J. Gluck,et al.  A Chemical Structure Storage and Search System Developed at Du Pont. , 1965 .