LinkHub : a Semantic Web System for Efficiently Handling Complex Graphs of Proteomics Identifier Relationships that Facilitates Cross-database Queries and Information Retrieval

Background: A key abstraction in representing proteomics knowledge is the notion of unique identifiers for individual entities (e.g. proteins) and the massive graph of relationships among them. These relationships are sometimes simple (e.g. synonyms) but are often more complex (e.g. one-to-many relationships in protein family membership). Results: We have built a software system called LinkHub using semantic-web RDF that manages the graph of identifier relationships and allows exploration with a variety of interfaces. For efficiency, we also provide relational-database access and translation between the relational and RDF versions. LinkHub is practically useful in creating small, local hubs on common topics and then connecting these to major portals in a federated architecture; we have used LinkHub to establish such a relationship between UniProt and the North East Structural Genomics Consortium. LinkHub also facilitates queries and access to information and documents related to identifiers spread across multiple databases, acting as "connecting glue" between different identifier spaces. We demonstrate this with example queries discovering “interologs” of yeast protein interactions in the worm and exploring the relationship between gene essentiality and pseudogene content, and also showing how “protein family based” retrieval of documents can be achieved. LinkHub is at hub.gersteinlab.org and hub.nesg.org with supplements at hub.nesg.org/supplement; LinkHub’s database models and code may be downloaded at hub.nesg.org/download. Conclusion: LinkHub leverages semantic web standards-based integrated data to provide novel information retrieval to identifierrelated documents through relational graph queries, simplifies and manages connections to major hubs such as UniProt, and provides useful interactive and query interfaces for exploring the integrated data.

[1]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[2]  Ya N N I S K A L F O G L O U,et al.  Ontology mapping: the state of the art* , 2003 .

[3]  Robert Stevens,et al.  Sealife: A Semantic Grid Browser for the Life Sciences Applied to the Study of Infectious Diseases , 2006, HealthGrid.

[4]  Arjohn Kampman,et al.  SeRQL: A Second Generation RDF Query Language , 2003 .

[5]  Kimberly Van Auken,et al.  WormBase: a comprehensive data resource for Caenorhabditis biology and genomics , 2004, Nucleic Acids Res..

[6]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[7]  M. Gerstein,et al.  Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. , 2004, Genome research.

[8]  Frank van Harmelen,et al.  A semantic web primer , 2004 .

[9]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[10]  Ross D King,et al.  Are the current ontologies in biology good ontologies? , 2005, Nature Biotechnology.

[11]  Laura M. Haas,et al.  DiscoveryLink: A system for integrated access to life sciences data sources , 2001, IBM Syst. J..

[12]  Emmanuel Pietriga,et al.  Selecting Biological Data Sources and Tools with XPR, a Path Language for RDF , 2006, Pacific Symposium on Biocomputing.

[13]  Kei-Hoi Cheung,et al.  YeastHub: a semantic web use case for integrating data in the life sciences domain , 2005, ISMB.

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15]  Walter V. Sujansky,et al.  Heterogeneous Database Integration in Biomedicine , 2001, J. Biomed. Informatics.

[16]  Claudio Gutiérrez,et al.  Querying RDF Data from a Graph Database Perspective , 2005, ESWC.

[17]  Mark Gerstein,et al.  A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome evolution. , 2002, Journal of molecular biology.

[18]  L Wong,et al.  Development of software tools at BioInformatics Centre (BIC) at the National University of Singapore (NUS). , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[19]  Kei-Hoi Cheung,et al.  The TRIPLES database: a community resource for yeast molecular biology , 2002, Nucleic Acids Res..

[20]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Research.

[21]  Sean Martin,et al.  Globally distributed object identification for biological knowledgebases , 2004, Briefings Bioinform..

[22]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[23]  Xiaoshu Wang,et al.  From XML to RDF: how semantic web technologies will change the design of 'omic' standards , 2005, Nature Biotechnology.

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  Mark Gerstein,et al.  SPINE 2: a system for collaborative structural proteomics within a federated database framework. , 2003, Nucleic acids research.

[26]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[27]  Rolf Apweiler,et al.  The EBI SRS Server: Recent Developments , 2002, German Conference on Bioinformatics.

[28]  Eric K. Neumann,et al.  A Life Science Semantic Web: Are We There Yet? , 2005, Science's STKE.

[29]  M. Gerstein,et al.  Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. , 2002, Journal of molecular biology.

[30]  P D Karp,et al.  Database links are a foundation for interoperability. , 1996, Trends in biotechnology.

[31]  Wendy Hall,et al.  The Semantic Web Revisited , 2006, IEEE Intelligent Systems.

[32]  M. Gerstein,et al.  Large-scale analysis of pseudogenes in the human genome. , 2004, Current opinion in genetics & development.

[33]  David R. Karger,et al.  Haystack: A Platform for Authoring End User Semantic Web Applications , 2003, WWW.

[34]  Mark Gerstein,et al.  Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. , 2003, Genome research.

[35]  Ambuj K. Singh,et al.  Efficient view maintenance at data warehouses , 1997, SIGMOD '97.