ETL Processes in the Era of Variety

Nowadays, we are living in an open and connected world, where small, medium and large companies are looking for integrating data from various data sources to satisfy the requirements of new applications such as delivering real-time alerts and trigger automated actions, complex system failure detection, anomalies detection, etc. The process of getting these data from their sources to its home system in efficient and correct manner is known by data ingestion, usually refer to Extract, Transform, Load (ETL) widely studied in data warehouses. In the context of rapidly technology changing and the explosion of data sources, ETL processes have to consider two main issues: (a) the variety of data sources that spans traditional, XML, semantic, graph databases, etc. and (b) the variety of storage platforms, where the home system may have several stores (known by polystore), where one hosts a particular type of data. These issues directly impact the efficiency and the deployment flexibility of ETL. In this paper, we deal with these issues. Firstly, thanks to Model Driven Engineering, we make generic different types of data sources. This genericity allows overloading the ETL operators for each type of sources. This genericity is illustrated by considering three types of the most popular data sources: relational, semantic and graph databases. Secondly, we show the impact of genericity of operators in the ETL workflow, where a Web-service-driven approach for orchestrating the ETL flows is given. Thirdly, the extracted and merged data obtained by the ETL workflow are deployed according their favorite stores. Finally, our finding is validated through a proof of concept tool using the LUBM semantic database and Yago graph deployed in Oracle RDF Semantic Graph 12c.

[1]  Juan Trujillo,et al.  A UML Based Approach for Modeling ETL Processes in Data Warehouses , 2003, ER.

[2]  Diego Calvanese,et al.  Data Integration in Data Warehousing (Keynote Address) , 2001, CAiSE Workshops.

[3]  Panos Vassiliadis,et al.  Deciding the physical implementation of ETL workflows , 2007, DOLAP '07.

[4]  Kevin Wilkinson,et al.  Leveraging Business Process Models for ETL Design , 2010, ER.

[5]  Vasileios Theodorou,et al.  Bijoux: Data Generator for Evaluating ETL Process Quality , 2014, DOLAP '14.

[6]  Rafael Berlanga Llavori,et al.  Building data warehouses with semantic data , 2010, EDBT '10.

[7]  Panos Vassiliadis,et al.  Modeling ETL activities as graphs , 2002, DMDW.

[8]  Michael Stonebraker Technical perspectiveOne size fits all: an idea whose time has come and gone , 2008, CACM.

[9]  Kevin Wilkinson,et al.  Optimizing ETL workflows for fault-tolerance , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[10]  Panos Vassiliadis,et al.  Conceptual modeling for ETL processes , 2002, DOLAP '02.

[11]  Panos Vassiliadis,et al.  Data Mapping Diagrams for Data Warehouse Design with UML , 2004, ER.

[12]  Robert Wrembel,et al.  From conceptual design to performance optimization of ETL workflows: current state of research and open problems , 2017, The VLDB Journal.

[13]  Oded Shmueli,et al.  Logical diagnosis ofLDL programs , 1990, New Generation Computing.

[14]  Panos Vassiliadis A Survey of Extract-Transform-Load Technology , 2009, Int. J. Data Warehous. Min..

[15]  Tore Risch,et al.  Querying combined cloud-based and relational databases , 2011, 2011 International Conference on Cloud and Service Computing.

[16]  Panos Vassiliadis,et al.  A taxonomy of ETL activities , 2009, DOLAP.

[17]  Panos Vassiliadis,et al.  A generic and customizable framework for the design of ETL scenarios , 2005, Inf. Syst..

[18]  Ladjel Bellatreche,et al.  Towards a conceptualization of ETL and physical storage of semantic data warehouses as a service , 2013, Cluster Computing.

[19]  Ladjel Bellatreche,et al.  Managing Data Warehouse Traceability: A Life-Cycle Driven Approach , 2015, CAiSE.

[20]  Antoni Olivé,et al.  An object-oriented operation-based approach to translation between MOF metaschemas , 2008, Data Knowl. Eng..

[21]  Mickaël Baron,et al.  OntoDBench: Interactively Benchmarking Ontology Storage in a Database , 2013, ER.

[22]  Yannis Papakonstantinou,et al.  The SQL++ Unifying Semi-structured Query Language, and an Expressiveness Benchmark of SQL-on-Hadoop, NoSQL and NewSQL Databases , 2014 .

[23]  Michael Stonebraker,et al.  The BigDAWG Polystore System , 2015, SGMD.

[24]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[25]  Kevin Wilkinson,et al.  Optimizing analytic data flows for multiple execution engines , 2012, SIGMOD Conference.

[26]  Iain Craig The interpretation of object-oriented programming languages , 2000 .

[27]  David J. DeWitt,et al.  Split query processing in polybase , 2013, SIGMOD '13.

[28]  Diego Calvanese,et al.  Description Logics for Conceptual Data Modeling , 1998, Logics for Databases and Information Systems.

[29]  Esteban Zimányi,et al.  BPMN-Based Conceptual Modeling of ETL Processes , 2012, DaWaK.

[30]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[31]  Ladjel Bellatreche,et al.  A Variety-Sensitive ETL Processes , 2017, DEXA.

[32]  Dimitrios Skoutas,et al.  Ontology-Based Conceptual Design of ETL Processes for Both Structured and Semi-Structured Data , 2007, Int. J. Semantic Web Inf. Syst..

[33]  Timos K. Sellis,et al.  Optimizing ETL processes in data warehouses , 2005, 21st International Conference on Data Engineering (ICDE'05).

[34]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[35]  Marko A. Rodriguez,et al.  Constructions from Dots and Lines , 2010, ArXiv.

[36]  Jose-Norberto Mazón,et al.  An MDA approach for the development of data warehouses , 2008, Decis. Support Syst..

[37]  W. H. Inmon,et al.  Building the data warehouse , 1992 .

[38]  Patrick Valduriez,et al.  CloudMdsQL: querying heterogeneous cloud data stores with a common language , 2016, Distributed and Parallel Databases.