Mapping heterogeneous research infrastructure metadata into a unified catalogue for use in a generic virtual research environment

Abstract Virtual Research Environments (VREs), also known as science gateways or virtual laboratories, assist researchers in data science by integrating tools for data discovery, data retrieval, workflow management and researcher collaboration, often coupled with a specific computing infrastructure. Recently, the push for better open data science has led to the creation of a variety of dedicated research infrastructures (RIs) that gather data and provide services to different research communities, all of which can be used independently of any specific VRE. There is therefore a need for generic VREs that can be coupled with the resources of many different RIs simultaneously, easily customised to the needs of specific communities. The resource metadata produced by these RIs rarely all adhere to any one standard or vocabulary however, making it difficult to search and discover resources independently of their providers without some translation into a common framework. Cross-RI search can be expedited by using mapping services that harvest RI-published metadata to build unified resource catalogues, but the development and operation of such services pose a number of challenges. In this paper, we discuss some of these challenges and look specifically at the VRE4EIC Metadata Portal, which uses X3ML mappings to build a single catalogue for describing data products and other resources provided by multiple RIs. The Metadata Portal was built in accordance to the e-VRE Reference Architecture, a microservice-based architecture for generic modular VREs, and uses the CERIF standard to structure its catalogued metadata. We consider the extent to which it addresses the challenges of cross-RI search, particularly in the environmental and earth science domain, and how it can be further augmented, for example to take advantage of linked vocabularies to provide more intelligent semantic search across multiple domains of discourse.

[1]  Peter Ochieng,et al.  Large-Scale Ontology Matching , 2018, ACM Comput. Surv..

[2]  Eda Marchetti,et al.  A Reference Architecture for Virtual Research Environments , 2017, ISI.

[3]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[4]  Martin Doerr,et al.  X3ML mapping framework for information integration in cultural heritage and beyond , 2017, International Journal on Digital Libraries.

[5]  Cees T. A. M. de Laat,et al.  Open Information Linking for Environmental Research Infrastructures , 2015, 2015 IEEE 11th International Conference on e-Science.

[6]  Robert Arp,et al.  Building Ontologies with Basic Formal Ontology , 2015 .

[7]  Christophe Arviset,et al.  The IVOA Architecture , 2012 .

[8]  Brigitte Jörg,et al.  CERIF: The Common European Research Information Format Model , 2010, Data Sci. J..

[9]  Jorge Pérez,et al.  LDQL: A Query Language for the Web of Linked Data (Extended Version) , 2016, J. Web Semant..

[10]  Giridhar Manepalli,et al.  Data Type Registries working group output , 2015 .

[11]  Patrick Lambrix,et al.  User Validation in Ontology Alignment , 2016, SEMWEB.

[12]  Spiros Athanasiou,et al.  Exposing INSPIRE on the Semantic Web , 2015, J. Web Semant..

[13]  Alex Rodriguez,et al.  The Globus Galaxies platform: delivering science gateways as a service , 2015, Concurr. Comput. Pract. Exp..

[14]  Herbert Schentz,et al.  EnvThes - interlinked thesaurus for long term ecological research, monitoring, and experiments , 2013, EnviroInfo.

[15]  Sean Bechhofer,et al.  Research Objects: Towards Exchange and Reuse of Digital Knowledge , 2010 .

[16]  Cees T. A. M. de Laat,et al.  Reference Model Guided System Design and Implementation for Interoperable Environmental Research Infrastructures , 2015, 2015 IEEE 11th International Conference on e-Science.

[17]  Ian T. Foster,et al.  Scaling System-Level Science: Scientific Exploration and IT Implications , 2006, Computer.

[18]  Stefanie N. Lindstaedt,et al.  Realising the European Open Science Cloud , 2016 .

[19]  Jano I. van Hemert,et al.  Scientific Workflows , 2016, ACM Comput. Surv..

[20]  Pierfrancesco Bellini,et al.  Performance assessment of RDF graph databases for smart city services , 2018, J. Vis. Lang. Comput..

[21]  John D. Nelson,et al.  Linking open data and the crowd for real-time passenger information , 2017, J. Web Semant..

[22]  Amitava Majumdar,et al.  The CIPRES workbench: a flexible framework for creating science gateways , 2015, XSEDE.

[23]  P. Bryan Heidorn,et al.  Shedding Light on the Dark Data in the Long Tail of Science , 2008, Libr. Trends.

[24]  Zhiming Zhao,et al.  Contemporary challenges for data-intensive scientific workflow management systems , 2015, WORKS@SC.

[25]  Zhiming Zhao,et al.  Computational Challenges in Global Environmental Research Infrastructures , 2017 .

[26]  R. Doyle The American terrorist. , 2001, Scientific American.

[27]  Maria-Esther Vidal,et al.  Decomposing federated queries in presence of replicated fragments , 2017, J. Web Semant..

[28]  Ilkay Altintas,et al.  Provenance Collection Support in the Kepler Scientific Workflow System , 2006, IPAW.

[29]  Yang Hu,et al.  Time‐critical data management in clouds: Challenges and a Dynamic Real‐Time Infrastructure Planner (DRIP) solution , 2019, Concurr. Comput. Pract. Exp..

[30]  Keith G. Jeffery,et al.  Mapping Solid Earth Data and Research Infrastructures to CERIF , 2016, CRIS.

[31]  Hongyan Wu,et al.  BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data , 2014, J. Biomed. Semant..

[32]  Shawn Bowers,et al.  An ontology for describing and synthesizing ecological observation data , 2007, Ecol. Informatics.

[33]  C. Lagoze,et al.  The making of the Open Archives Initiative Protocol for Metadata Harvesting , 2003 .

[34]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[35]  Hoan Quoc Nguyen-Mau,et al.  The Graph of Things: A step towards the Live Knowledge Graph of connected things , 2016, J. Web Semant..

[36]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[37]  Pasquale Pagano,et al.  Virtual Research Environments: An Overview and a Research Agenda , 2013, Data Sci. J..

[38]  Tomasz Miksa,et al.  Using ontologies for verification and validation of workflow-based experiments , 2017, J. Web Semant..

[39]  Cees T. A. M. de Laat,et al.  Distributed execution of aggregated multi domain workflows using an agent framework , 2007, 2007 IEEE Congress on Services (Services 2007).

[40]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[41]  A. Stellato Dictionary, Thesaurus or Ontology? Disentangling Our Choices in the Semantic Web Jungle , 2012 .