Enabling ad-hoc reuse of private data repositories through schema extraction

Background Sharing sensitive data across organizational boundaries is often significantly limited by legal and ethical restrictions. Regulations such as the EU General Data Protection Rules (GDPR) impose strict requirements concerning the protection of personal and privacy sensitive data. Therefore new approaches, such as the Personal Health Train initiative, are emerging to utilize data right in their original repositories, circumventing the need to transfer data. Results Circumventing limitations of previous systems, this paper proposes a configurable and automated schema extraction and publishing approach, which enables ad-hoc SPARQL query formulation against RDF triple stores without requiring direct access to the private data. The approach is compatible with existing Semantic Web-based technologies and allows for the subsequent execution of such queries in a safe setting under the data provider’s control. Evaluation with four distinct datasets shows that a configurable amount of concise and task-relevant schema, closely describing the structure of the underlying data, was derived, enabling the schema introspection-assisted authoring of SPARQL queries. Conclusions Automatically extracting and publishing data schema can enable the introspection-assisted creation of data selection and integration queries. In conjunction with the presented system architecture, this approach can enable reuse of data from private repositories and in settings where agreeing upon a shared schema and encoding a priori is infeasible. As such, it could provide an important step towards reuse of data from previously inaccessible sources and thus towards the proliferation of data-driven methods in the biomedical domain.

[1]  Bernardo Cuenca Grau,et al.  OWL 2 Web Ontology Language: Profiles , 2009 .

[2]  Fabian Prasser,et al.  Data Integration for Future Medicine (DIFUTURE) , 2018, Methods of Information in Medicine.

[3]  Kenza Kellou-Menouer,et al.  Schema Discovery in RDF Data Sources , 2015, ER.

[4]  Egor V. Kostylev,et al.  SPARQL with Property Paths , 2015, SEMWEB.

[5]  Dan Brickley,et al.  Resource Description Framework (RDF) Model and Syntax Specification , 2002 .

[6]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[7]  Jessica A. Turner,et al.  The Ontology for Biomedical Investigations , 2016, PloS one.

[8]  Antoine Zimmermann,et al.  Flexible RDF Generation from RDF and Heterogeneous Data Sources with SPARQL-Generate , 2016, EKAW.

[9]  Rafael Valencia-García,et al.  OWLPath: An OWL Ontology-Guided Query Editor , 2011, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[10]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[11]  Dietrich Rebholz-Schuhmann,et al.  SAFE: SPARQL Federation over RDF Data Cubes with Access Control , 2017, J. Biomed. Semant..

[12]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[13]  Philip R. O. Payne,et al.  Clinical research informatics: challenges, opportunities and definition for an emerging domain. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[14]  L. Etheredge,et al.  A rapid-learning health system. , 2007, Health affairs.

[15]  P. Lambin,et al.  Distributed learning: Developing a predictive model based on data from multiple hospitals without data leaving the hospital - A real life proof of concept. , 2016, Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology.

[16]  Harold R. Solbrig,et al.  Shape expressions: an RDF validation and transformation language , 2014, SEM '14.

[17]  M. Cornel,et al.  [Orphanet: a European database for rare diseases]. , 2008, Nederlands tijdschrift voor geneeskunde.

[18]  Sonia Bergamaschi,et al.  Visual Querying LOD sources with LODeX , 2015, K-CAP.

[19]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[20]  Denis Parra,et al.  A Visual Aide for Understanding Endpoint Data , 2016, VOILA@ISWC.

[21]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[22]  D. Blumenthal,et al.  Achieving a Nationwide Learning Health System , 2010, Science Translational Medicine.

[23]  Thomas Baker,et al.  Requirements for vocabulary preservation and governance , 2013, Libr. Hi Tech.

[24]  Christopher G. Chute,et al.  BioPortal: ontologies and integrated data resources at the click of a mouse , 2009, Nucleic Acids Res..

[25]  Kevin Donnelly,et al.  SNOMED-CT: The advanced terminology and coding system for eHealth. , 2006, Studies in health technology and informatics.

[26]  Sonia Bergamaschi,et al.  Online Index Extraction from Linked Open Data Sources , 2014, LD4IE@ISWC.

[27]  Jan E. Gewehr,et al.  Smart Medical Information Technology for Healthcare (SMITH) , 2018, Methods of Information in Medicine.

[28]  Sarah M. Greene,et al.  Implementing the Learning Health System: From Concept to Action , 2012, Annals of Internal Medicine.

[29]  L. Criswell,et al.  American College of Rheumatology classification criteria for Sjögren's syndrome: A data‐driven, expert consensus approach in the Sjögren's International Collaborative Clinical Alliance Cohort , 2012, Arthritis care & research.

[30]  Jukka Huhtamäki,et al.  Understanding Business Ecosystem Dynamics: A Data-Driven Approach , 2015, TMIS.

[31]  Ioanna Chouvarda,et al.  A reusable ontology for primitive and complex HL7 FHIR data types , 2015, 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[32]  Boris Motik,et al.  OWL 2 Web Ontology Language: structural specification and functional-style syntax , 2008 .

[33]  Steffen Lohmann,et al.  LD-VOWL: Extracting and Visualizing Schema Information for Linked Data Endpoints , 2016, VOILA@ISWC.

[34]  J. Wyatt Decision support systems. , 2000, Journal of the Royal Society of Medicine.

[35]  Óscar Corcho,et al.  Federating queries in SPARQL 1.1: Syntax, semantics and evaluation , 2013, J. Web Semant..

[36]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[37]  Michel Dumontier,et al.  Bio2RDF Release 3: A larger, more connected network of Linked Data for the Life Sciences , 2014, SEMWEB.

[38]  Lukas Eipert Metadatenextraktion und Vorschlagssysteme im Visual SPARQL Builder , 2015, GI-Jahrestagung.

[39]  Martin Hepp,et al.  GoodRelations: An Ontology for Describing Products and Services Offers on the Web , 2008, EKAW.

[40]  Gary D. Bader,et al.  Dataset Descriptions: HCLS Community Profile , 2015 .

[41]  S. S. Weinreich,et al.  Orphanet : een Europese database over zeldzame ziekten , 2008 .

[42]  Thomas Craig,et al.  Dissecting the Gene Network of Dietary Restriction to Identify Evolutionarily Conserved Pathways and New Functional Genes , 2012, PLoS genetics.

[43]  Timo M. Deist,et al.  Infrastructure and distributed learning methodology for privacy-preserving multi-centric rapid learning health care: euroCAT , 2017, Clinical and translational radiation oncology.

[44]  Pierre-Yves Vandenbussche,et al.  Linked Open Vocabularies , 2014, ERCIM News.

[45]  Thomas Ertl,et al.  Visualizing ontologies with VOWL , 2016, Semantic Web.

[46]  David Simchi-Levi,et al.  OM Forum - OM Research: From Problem-Driven to Data-Driven Research , 2014, Manuf. Serv. Oper. Manag..

[47]  David P. Anderson Preserving hybrid objects , 2016, Commun. ACM.

[48]  Patrick Granton,et al.  Radiomics: extracting more information from medical images using advanced feature analysis. , 2012, European journal of cancer.

[49]  Thomas Ertl,et al.  QueryVOWL: Visual Composition of SPARQL Queries , 2015, ESWC.

[50]  Dennis McLeod,et al.  A federated architecture for information management , 1985, TOIS.

[51]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[52]  Stefan Decker,et al.  Schema Extraction for Privacy Preserving Processing of Sensitive Data , 2018 .

[53]  Alexander V. Smirnov,et al.  SPARQL Query Builders: Overview and Comparison , 2016, BIR Workshops.

[54]  Andy Seaborne,et al.  SPARQL/Update: A language for updating RDF graphs , 2007 .

[55]  Mark A. Musen,et al.  Using SPARQL to Query BioPortal Ontologies and Metadata , 2012, SEMWEB.

[56]  Antoine Zimmermann,et al.  A SPARQL Extension for Generating RDF from Heterogeneous Formats , 2017, ESWC.

[57]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[58]  Vojtech Svátek,et al.  Dataset Summary Visualization with LODSight , 2015, ESWC.

[59]  John F. Pane,et al.  Making Sense of Data-Driven Decision Making in Education , 2006 .

[60]  Steffen Lohmann,et al.  Extraction and Visualization of TBox Information from SPARQL Endpoints , 2016, EKAW.

[61]  Ian Horrocks,et al.  The Semantic Web: The Roles of XML and RDF , 2000, IEEE Internet Comput..

[62]  María Poveda-Villalón,et al.  Linked Open Vocabularies (LOV): A gateway to reusable semantic vocabularies on the Web , 2016, Semantic Web.

[63]  G Stix,et al.  The mice that warred. , 2001, Scientific American.

[64]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[65]  Chalapathy Neti,et al.  Rapid-learning system for cancer care. , 2010, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[66]  Tom Jefferson,et al.  The Imperative to Share Clinical Study Reports: Recommendations from the Tamiflu Experience , 2012, PLoS medicine.

[67]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..