Computing Identity Co-Reference Across Drug Discovery Datasets

This paper presents the rules used within the OpenPHACTS (http://www.openphacts.org) Identity Management Service to compute co-reference chains across multiple datasets. The web of (linked) data has encouraged a proliferation of identifiers for the concepts cap- tured in datasets; with each dataset using their own identifier. A key data integration challenge is linking the co-referent identifiers, i.e. identifying and linking the equivalent concept in every dataset. Exacerbating this challenge, the datasets model the data differently, so when is one repre- sentation truly the same as another? Finally, different users have their own task and domain specific notions of equivalence that are driven by their operational knowledge. Consumers of the data need to be able to choose the notion of operational equivalence to be applied for the con- text of their application. We highlight the challenges of automatically computing co-reference and the need for capturing the context of the equivalence. This context is then used to control the co-reference computation. Ultimately, the context will enable data consumers to decide which co-references to include in their applications.

[1]  Christian Y. A. Brenninkmeijer,et al.  Scientific Lenses over Linked Data: An Approach to Support Task Specific Views of the Data. A Vision , 2012, LISC@ISWC.

[2]  Hugh Glaser,et al.  Managing Co-reference on the Semantic Web , 2009, LDOW.

[3]  Sean Martin,et al.  Globally distributed object identification for biological knowledgebases , 2004, Briefings Bioinform..

[4]  Michael Y. Galperin,et al.  The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection , 2011, Nucleic Acids Res..

[5]  Carole A. Goble,et al.  State of the nation in data integration for bioinformatics , 2008, J. Biomed. Informatics.

[6]  Ying Zhang,et al.  HMDB: the Human Metabolome Database , 2007, Nucleic Acids Res..

[7]  Nicolas Le Novère,et al.  Identifiers.org and MIRIAM Registry: community resources to provide persistent identification , 2011, Nucleic Acids Res..

[8]  Michael Y. Galperin,et al.  The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection , 2011, Nucleic Acids Res..

[9]  Michel Dumontier,et al.  Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance of Life Science Linked Data , 2013, ESWC.

[10]  Chris T. A. Evelo,et al.  Applying linked data approaches to pharmacology: Architectural decisions and implementation , 2014, Semantic Web.

[11]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[12]  Chris T. A. Evelo,et al.  The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services , 2010, BMC Bioinformatics.

[13]  Antony J. Williams,et al.  ChemSpider:: An Online Chemical Information Resource , 2010 .

[14]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[15]  Andrew M. Jenkinson,et al.  The EBI RDF platform: linked open data for the life sciences , 2014, Bioinform..

[16]  Deborah L. McGuinness,et al.  When owl: sameAs Isn't the Same: An Analysis of Identity in Linked Data , 2010, SEMWEB.

[17]  Christian Y. A. Brenninkmeijer,et al.  Including Co-referent URIs in a SPARQL Query , 2013, COLD.

[18]  Stian Soiland-Reyes,et al.  PAV ontology: provenance, authoring and versioning , 2013, J. Biomed. Semant..

[19]  Gregory D. Hager,et al.  X Vision: A Portable Substrate for Real-Time Vision Applications , 1998, Comput. Vis. Image Underst..

[20]  The UniProt Consortium,et al.  Update on activities at the Universal Protein Resource (UniProt) in 2013 , 2012, Nucleic Acids Res..

[21]  David S. Wishart,et al.  DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs , 2010, Nucleic Acids Res..

[22]  Michael Hausenblas,et al.  Describing linked datasets with the VoID vocabulary , 2011 .