oreChem ChemXSeer: a semantic digital library for chemistry

Representing the semantics of unstructured scientific publications will certainly facilitate access and search and hopefully lead to new discoveries. However, current digital libraries are usually limited to classic flat structured metadata even for scientific publications that potentially contain rich semantic metadata. In addition, how to search the scientific literature of linked semantic metadata is an open problem. We have developed a semantic digital library oreChem ChemxSeer that models chemistry papers with semantic metadata. It stores and indexes extracted metadata from a chemistry paper repository Chemx Seer using "compound objects". We use the Open Archives Initiative Object Reuse and Exchange (OAI-ORE) (http://www.openarchives.org/ore/ standard to define a compound object that aggregates metadata fields related to a digital object. Aggregated metadata can be managed and retrieved easily as one unit resulting in improved ease-of-use and has the potential to improve the semantic interpretation of shared data. We show how metadata can be extracted from documents and aggregated using OAI-ORE. ORE objects are created on demand; thus, we are able to search for a set of linked metadata with one query. We were also able to model new types of metadata easily. For example, chemists are especially interested in finding information related to experiments in documents. We show how paragraphs containing experiment information in chemistry papers can be extracted and tagged based on a chemistry ontology with 470 classes, and then represented in ORE along with other document-related metadata. Our algorithm uses a classifier with features that are words that are typically only used to describe experiments, such as "apparatus", "prepare", etc. Using a dataset comprised of documents from the Royal Society of Chemistry digital library, we show that the our proposed methodperforms well in extracting experiment-related paragraphs from chemistry documents.

[1]  George Buchanan,et al.  Greenstone: A Platform for Distributed Digital Library Applications , 2001, ECDL.

[2]  Valentin Monev,et al.  Introduction to Similarity Searching in Chemistry , 2005 .

[3]  Marti A. Hearst,et al.  NLP Support for Faceted Navigation in Scholarly Collection , 2009, Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries - NLPIR4DL '09.

[4]  Jeremy J. Carroll,et al.  Named graphs, provenance and trust , 2005, WWW '05.

[5]  Herbert Van de Sompel,et al.  Object Re-Use & Exchange: A Resource-Centric Approach , 2008, ArXiv.

[6]  Enrico Motta,et al.  ScholOnto: an ontology-based digital library server for research documents and discourse , 2000, International Journal on Digital Libraries.

[7]  O. E. Polansky,et al.  Introduction to Similarity Searching in Chemistry , 2004 .

[8]  C. Lee Giles,et al.  Mining, indexing, and searching for textual chemical molecule information on the web , 2008, WWW.

[9]  Carole A. Goble,et al.  Semantic web applications to e-science in silico experiments , 2004, WWW Alt. '04.

[10]  Herbert Van de Sompel,et al.  Adding eScience Assets to the Data Web , 2009, ArXiv.

[11]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[12]  D. Banville Mining chemical structural information from the drug literature. , 2006, Drug discovery today.

[13]  Madian Khabsa,et al.  SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web , 2010, WebApps.

[14]  Kun Bai,et al.  Automatic extraction of table metadata from digital documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[15]  C. Lee Giles,et al.  Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents , 2008, AAAI.

[16]  George Buchanan,et al.  FRBR: enriching and integrating digital libraries , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[17]  Stefan Decker,et al.  JeromeDL - Adding Semantic Web Technologies to Digital Libraries , 2005, DEXA.

[18]  Sandra Payette,et al.  Fedora: an architecture for complex objects and their relationships , 2005, International Journal on Digital Libraries.

[19]  Peter Murray-Rust,et al.  Development of chemical markup language (CML) as a system for handling complex chemical content , 2001 .

[20]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..