An empirical meta-analysis of the life sciences linked open data on the web

While the biomedical community has published several “open data” sources in the last decade, most researchers still endure severe logistical and technical challenges to discover, query, and integrate heterogeneous data and knowledge from multiple sources. To tackle these challenges, the community has experimented with Semantic Web and linked data technologies to create the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we extract schemas from more than 80 biomedical linked open data sources into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. We observe that several LSLOD sources exist as stand-alone data sources that are not inter-linked with other sources, use unpublished schemas with minimal reuse or mappings, and have elements that are not useful for data integration from a biomedical perspective. We envision that the LSLOD schema graph and the findings from this research will aid researchers who wish to query and integrate data and knowledge from multiple biomedical sources simultaneously on the Web.

[1]  Axel Polleres,et al.  A More Decentralized Vision for Linked Data , 2020, DeSemWeb@ISWC.

[2]  Mark A. Musen,et al.  PhLeGrA: Graph Analytics in Pharmacology over the Web of Life Sciences Linked Open Data , 2017, WWW.

[3]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[4]  Maria Liakata,et al.  Dynamic enhancement of drug product labels to support drug safety, efficacy, and effectiveness , 2013, J. Biomed. Semant..

[5]  Sean Bechhofer,et al.  OWL: Web Ontology Language , 2009, Encyclopedia of Database Systems.

[6]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[7]  Amos Bairoch,et al.  neXtProt: a knowledge platform for human proteins , 2011, Nucleic Acids Res..

[8]  Paul N. Schofield,et al.  The Units Ontology: a tool for integrating units of measurement in science , 2012, Database J. Biol. Databases Curation.

[9]  Russ B. Altman,et al.  A global network of biomedical relationships derived from text , 2018, Bioinform..

[10]  A. Skrbo,et al.  [Classification of drugs using the ATC system (Anatomic, Therapeutic, Chemical Classification) and the latest changes]. , 2004, Medicinski arhiv.

[11]  Tatiana A. Tatusova,et al.  BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata , 2011, Nucleic Acids Res..

[12]  Bin Chen,et al.  The ChEMBL database as linked open data , 2013, Journal of Cheminformatics.

[13]  Kent A. Spackman,et al.  SNOMED clinical terms: overview of the development process and project status , 2001, AMIA.

[14]  陈奕欣,et al.  The Universal Protein Resource (UniProt) , 2007, Nucleic Acids Res..

[15]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[16]  Ensembl , 2020, Definitions.

[17]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[18]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[19]  Rolf Apweiler,et al.  The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries , 2006, BMC Bioinformatics.

[20]  Pasquale De Meo,et al.  Generalized Louvain method for community detection in large networks , 2011, 2011 11th International Conference on Intelligent Systems Design and Applications.

[21]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[22]  John A. Kunze,et al.  The Dublin Core Metadata Element Set , 2007, RFC.

[23]  Thomas R. Gruber,et al.  Toward principles for the design of ontologies used for knowledge sharing? , 1995, Int. J. Hum. Comput. Stud..

[24]  Kei-Hoi Cheung,et al.  Linking Open Drug Data , 2009, I-SEMANTICS.

[25]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[26]  Amrapali Zaveri,et al.  Linked Data for Life Sciences , 2017, Algorithms.

[27]  Ryan Miller,et al.  Using the Semantic Web for Rapid Integration of WikiPathways with Other Biological Online Data Resources , 2016, PLoS Comput. Biol..

[28]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[29]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[30]  Shima Dastgheib,et al.  Accelerating Drug Discovery in Rare and Complex Diseases , 2018, International Semantic Web Conference.

[31]  Toshihisa Takagi,et al.  NBDC RDF portal: a comprehensive repository for semantic data in life sciences , 2018, Database J. Biol. Databases Curation.

[32]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2004, Nucleic Acids Res..

[33]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[34]  Tudor Groza,et al.  The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species , 2016, bioRxiv.

[35]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[36]  Axel Polleres,et al.  Enabling Web-scale data integration in biomedicine through Linked Open Data , 2019, npj Digital Medicine.

[37]  Michael Carroll,et al.  Text Snippets to Corroborate Medical Relations: An Unsupervised Approach using a Knowledge Graph and Embeddings. , 2020, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[38]  Núria Queralt-Rosinach,et al.  DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants , 2016, Nucleic Acids Res..

[39]  Frank van Harmelen,et al.  Web Ontology Language , 2004 .

[40]  Brian McBride,et al.  The Resource Description Framework (RDF) and its Vocabulary Description Language RDFS , 2004, Handbook on Ontologies.

[41]  Elspeth A. Bruford,et al.  Genenames.org: the HGNC resources in 2015 , 2014, Nucleic Acids Res..

[42]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[43]  S. Doyle-Lindrud,et al.  Watson will see you now: a supercomputer to help clinicians make informed treatment decisions. , 2015, Clinical journal of oncology nursing.

[44]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[45]  N. Krasnogor,et al.  Predicting species emergence in simulated complex pre-biotic networks , 2018, PloS one.

[46]  Gary D. Bader,et al.  Pathway Commons, a web resource for biological pathway data , 2010, Nucleic Acids Res..

[47]  Russ B. Altman,et al.  PharmGKB: the Pharmacogenetics Knowledge Base , 2002, Nucleic Acids Res..

[48]  Armin Haller,et al.  What Are Links in Linked Open Data? A Characterization and Evaluation of Links between Knowledge Graphs on the Web , 2020, ACM J. Data Inf. Qual..

[49]  Henning Hermjakob,et al.  The Reactome pathway Knowledgebase , 2015, Nucleic acids research.

[50]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[51]  Núria Queralt-Rosinach,et al.  The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery , 2014, J. Biomed. Semant..

[52]  Mark D. Wilkinson,et al.  The Semantic Automated Discovery and Integration (SADI) Web service Design-Pattern, API and Reference Implementation , 2011, J. Biomed. Semant..

[53]  David Anderson,et al.  Transforming the Medical Subject Headings into Linked Data: Creating the Authorized Version of MeSH in RDF , 2015, Journal of library metadata.

[54]  Amit P. Sheth,et al.  Semantic Services, Interoperability and Web Applications - Emerging Concepts , 2011, Semantic Services, Interoperability and Web Applications.

[55]  F. Arnaud,et al.  From core referencing to data re-use: two French national initiatives to reinforce paleodata stewardship (National Cyber Core Repository and LTER France Retro-Observatory) , 2017 .

[56]  Dan Brickley,et al.  Schema.org , 2016, Commun. ACM.

[57]  Egon L. Willighagen,et al.  PubChemRDF: towards the semantic annotation of PubChem compound and substance databases , 2015, Journal of Cheminformatics.

[58]  C. Steinbeck,et al.  The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web , 2011, PloS one.

[59]  Asunción Gómez-Pérez,et al.  Methodologies, tools and languages for building ontologies: Where is their meeting point? , 2003, Data Knowl. Eng..

[60]  Tania Tudorache,et al.  Investigating term reuse and overlap in biomedical ontologies , 2015, ICBO.

[61]  Christoph Steinbeck,et al.  The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013 , 2012, Nucleic Acids Res..

[62]  Daniel R. Zerbino,et al.  Ensembl 2016 , 2015, Nucleic Acids Res..

[63]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[64]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[65]  Mark D. Wilkinson,et al.  The Semantic Automated Discovery and Integration (SADI) Web service Design-Pattern, API and Reference Implementation , 2011 .

[66]  Andrea Maurino,et al.  ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization , 2016, SumPre@ESWC.

[67]  Jürgen Umbrich,et al.  SPARQLES: Monitoring public SPARQL endpoints , 2017, Semantic Web.

[68]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[69]  Yolanda Gil,et al.  PROV Model Primer , 2012 .

[70]  Michael J. Lush,et al.  genenames.org: the HGNC resources in 2011 , 2010, Nucleic Acids Res..

[71]  Barend Mons,et al.  Open PHACTS: semantic interoperability for drug discovery. , 2012, Drug discovery today.

[72]  O Bodenreider,et al.  Biomedical ontologies in action: role in knowledge management, data integration and decision support. , 2008, Yearbook of medical informatics.

[73]  Arvind Malhotra,et al.  Xml schema part 2: datatypes , 1999 .

[74]  Pierre-Yves Vandenbussche,et al.  Linked Open Vocabularies , 2014, ERCIM News.

[75]  Jens Lehmann,et al.  Quality assessment for Linked Data: A Survey , 2015, Semantic Web.

[76]  Christoph Lange,et al.  Evaluating the quality of the LOD cloud: An empirical investigation , 2018, Semantic Web.

[77]  Michel Dumontier,et al.  Making Linked Data SPARQL with the InterMine Biological Data Warehouse , 2016, SWAT4LS.

[78]  Muhammad Saleem,et al.  Big linked cancer data: Integrating linked TCGA and PubMed , 2014, J. Web Semant..

[79]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[80]  Wei Hu,et al.  Link Analysis of Life Science Linked Data , 2015, SEMWEB.

[81]  Andreas Dengel,et al.  An Evolutionary Algorithm to Learn SPARQL Queries for Source-Target-Pairs - Finding Patterns for Human Associations in DBpedia , 2016, EKAW.

[82]  Elena Paslaru Bontas Simperl,et al.  Reusing ontologies on the Semantic Web: A feasibility study , 2009, Data Knowl. Eng..

[83]  Felix Naumann,et al.  Profiling linked open data with ProLOD , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[84]  Stefan Decker,et al.  ReVeaLD: A user-driven domain-specific interactive search platform for biomedical research , 2014, J. Biomed. Informatics.

[85]  Antoine Isaac,et al.  SKOS Simple Knowledge Organization System Primer , 2009 .

[86]  Stephen M. Moore,et al.  The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository , 2013, Journal of Digital Imaging.

[87]  Stefan Decker,et al.  A Roadmap for Navigating the Life Sciences Linked Open Data Cloud , 2014, JIST.

[88]  Sherri de Coronado,et al.  NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information , 2007, J. Biomed. Informatics.

[89]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[90]  Egon L. Willighagen,et al.  Emerging practices for mapping and linking life sciences data using RDF - A case series , 2012, J. Web Semant..

[91]  Michel Dumontier,et al.  Ontology Mapping for Life Science Linked Data , 2016, BMDID@ISWC.

[92]  Dan Brickley,et al.  Schema.org: Evolution of Structured Data on the Web , 2015, ACM Queue.

[93]  Michel Dumontier,et al.  Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance of Life Science Linked Data , 2013, ESWC.

[94]  Zhisheng Huang,et al.  Linked Life Data , 2012 .

[95]  Frank van Harmelen,et al.  LOD Laundromat: Why the Semantic Web Needs Centralization (Even If We Don't Like It) , 2016, IEEE Internet Computing.

[96]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database: update 2013 , 2012, Nucleic Acids Res..

[97]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[98]  Asunción Gómez-Pérez,et al.  Loupe - An Online Tool for Inspecting Datasets in the Linked Data Cloud , 2015, SEMWEB.

[99]  Tania Tudorache,et al.  A systematic analysis of term reuse and term overlap across biomedical ontologies , 2017, Semantic Web.

[100]  Mark A. Musen,et al.  Aligning Biomedical Metadata with Ontologies Using Clustering and Embeddings , 2019, ESWC.

[101]  Andrew M. Jenkinson,et al.  The EBI RDF platform: linked open data for the life sciences , 2014, Bioinform..

[102]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.