The continuous improvement in connectivity, storage and data processing capabilities allow access to a data deluge from sensors, social-media, news, user-generated, government and private data sources. Accordingly, in a modern data-oriented landscape, with the advent of various data capture and management technologies, organizations are rapidly shifting to datafication of their processes. In such an environment, analysts may need to deal with a collection of datasets, from relational to NoSQL, that holds a vast amount of data gathered from various private/open data islands, i.e. Data Lake. Organizing, indexing and querying the growing volume of internal data and metadata, in a data lake, is challenging and requires various skills and experiences to deal with dozens of new databases and indexing technologies: How to store information items? What technology to use for persisting the data? How to deal with the large volume of streaming data? How to trace and persist information about data? What technology to use for indexing the data? How to query the data lake? To address the above mentioned challenges, we present CoreDB - an open source data lake service - which offers researchers and developers a single REST API to organize, index and query their data and metadata. CoreDB manages multiple database technologies and offers a built-in design for security and tracing.
[1]
Boualem Benatallah,et al.
Temporal Provenance Model (TPM): Model and Query Language
,
2012,
ArXiv.
[2]
Boualem Benatallah,et al.
On Automating Basic Data Curation Tasks
,
2017,
WWW.
[3]
Luc Moreau,et al.
The Open Provenance Model
,
2007
.
[4]
Daniela Grigori,et al.
Process Analytics - Concepts and Techniques for Querying and Analyzing Process Data
,
2016
.
[5]
Sherif Sakr,et al.
DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication
,
2015,
Proc. VLDB Endow..
[6]
Clinton Gormley,et al.
Elasticsearch: The Definitive Guide
,
2015
.
[7]
Alon Y. Halevy,et al.
Managing Google's data lake: an overview of the Goods system
,
2016,
IEEE Data Eng. Bull..