Streaming ETL in Polystore Era

In today’s digital environment, businesses have to access, store and analyze in a real time fashion vast amounts of data issued from streaming graph-structure data sources. To meet these requirements, companies owning the data warehouse (\(\mathcal {DW}\)) technology have to combine hardware and software solutions to reduce the time latency between a \(\mathcal {DW}\) and its data sources. The explosion of advanced hardware deployment platforms such as polystore represents an opportunity as pointed in recent studies. But, deploying a graph-structure \(\mathcal {DW}\) over a polystore is not a simple task, since it requires two important phases which are data partitioning and allocation. We claim that these phases have to be connected to the ETL (Extract, Transform, Load) phase, especially its loading process. This connection questions the initial schedule of ETL and deployment processes. In this paper, we present a new approach that connects ETL and deployment processes and challenges their traditional scheduling to meet real time analysis requirements.

[1]  Patrick Valduriez,et al.  Principles of Distributed Database Systems, Third Edition , 2011 .

[2]  Feifei Li,et al.  Scalable Multi-query Optimization for SPARQL , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[3]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[4]  Ling Liu,et al.  Scaling Queries over Big RDF Graphs with Semantic Hash Partitioning , 2013, Proc. VLDB Endow..

[5]  Patrick Valduriez,et al.  Query processing in multistore systems: an overview , 2016, Int. J. Cloud Comput..

[6]  Carlos Ordonez,et al.  ETL-aware materialized view selection in semantic data stream warehouses , 2018, 2018 12th International Conference on Research Challenges in Information Science (RCIS).

[7]  Kurt Rothermel,et al.  GraphCEP: real-time data analytics using parallel complex event and graph processing , 2016, DEBS.

[8]  Hai Jin,et al.  Scalable SPARQL querying using path partitioning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[9]  Boualem Benatallah,et al.  A Value-Added Approach to Design BI Applications , 2016, DaWaK.

[10]  Stanley B. Zdonik,et al.  Data Ingestion for the Connected World , 2017, CIDR.

[11]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[12]  Prabhu Ram,et al.  Extracting delta for incremental data warehouse maintenance , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[13]  Shuangxi Liu,et al.  Data Updating and Query in Real-Time Data Warehouse System , 2008, 2008 International Conference on Computer Science and Software Engineering.

[14]  Maik Thiele,et al.  On-Demand ELT Architecture for Right-Time BI: Extending the Vision , 2013, Int. J. Data Warehous. Min..

[15]  Katja Hose,et al.  WARP: Workload-aware replication and partitioning for RDF , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[16]  Stefan Deßloch,et al.  Formalizing ETL Jobs for Incremental Loading of Data Warehouses , 2009, BTW.

[17]  Katja Hose,et al.  Partout: a distributed engine for efficient RDF processing , 2012, WWW.

[18]  Divesh Srivastava,et al.  Integrating the R Language Runtime System with a Data Stream Warehouse , 2017, DEXA.

[19]  Kevin Wilkinson,et al.  Optimizing ETL workflows for fault-tolerance , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[20]  Dongyan Zhao,et al.  Query Workload-based RDF Graph Fragmentation and Allocation , 2016, EDBT.

[21]  Haixun Wang,et al.  A Distributed Graph Engine for Web Scale RDF Data , 2013, Proc. VLDB Endow..

[22]  Panos Vassiliadis,et al.  Near Real Time ETL , 2009, New Trends in Data Warehousing and Data Analysis.

[23]  Vipin Kumar,et al.  Multilevel k-way hypergraph partitioning , 1999, DAC '99.

[24]  Vasilis Vassalos,et al.  Semi-Streamed Index Join for near-real time execution of ETL transformations , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[25]  Tore Risch,et al.  Querying combined cloud-based and relational databases , 2011, 2011 International Conference on Cloud and Service Computing.

[26]  Michael Stonebraker,et al.  The BigDAWG Polystore System , 2015, SGMD.

[27]  Ladjel Bellatreche,et al.  A Variety-Sensitive ETL Processes , 2017, DEXA.

[28]  Dimitrios Skoutas,et al.  Ontology-Based Conceptual Design of ETL Processes for Both Structured and Semi-Structured Data , 2007, Int. J. Semantic Web Inf. Syst..

[29]  Stefan Deßloch,et al.  Near Real-Time Data Warehousing Using State-of-the-Art ETL Tools , 2009, BIRTE.

[30]  Stefan Deßloch,et al.  Towards generating ETL processes for incremental loading , 2008, IDEAS '08.

[31]  Evaggelia Pitoura,et al.  ETL queues for active data warehousing , 2005, IQIS '05.

[32]  Alfredo Cuzzocrea,et al.  SLEMAS: An Approach for Selecting Materialized Views Under Query Scheduling Constraints , 2014, COMAD.

[33]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .