Towards a Programmable Semantic Extract-Transform-Load Framework for Semantic Data Warehouses

In order to create better decisions for business analytics, organizations increasingly use external data, structured, semi-structured and unstructured, in addition to the (mostly structured) internal data. Current Extract-Transform-Load (ETL) tools are not suitable for this "open world scenario" because they do not consider semantic issues in the integration process. Also, current ETL tools neither support processing semantic-aware data nor create a Semantic Data Warehouse (DW) as a semantic repository of semantically integrated data. This paper describes SETL: a (Python-based) programmable Semantic ETL framework. SETL builds on Semantic Web (SW) standards and tools and supports developers by offering a number of powerful modules, classes and methods for (dimensional and semantic) DW constructs and tasks. Thus it supports semantic-aware data sources, semantic integration, and creating a semantic DW, composed of an ontology and its instances. A comprehensive experimental evaluation comparing SETL to a solution made with traditional tools (requiring much more hand-coding) on a concrete use case, shows that SETL provides better performance, knowledge base quality and programmer productivity.

[1]  Wolfgang Lehner,et al.  Quality measures for ETL processes: from goals to implementation , 2014, Concurr. Comput. Pract. Exp..

[2]  Srividya Kona Bansal,et al.  Towards a Semantic Extract-Transform-Load (ETL) Framework for Big Data Integration , 2014, 2014 IEEE International Congress on Big Data.

[3]  Lorena Etcheverry,et al.  Modeling and Querying Data Warehouses on the Semantic Web Using QB4OLAP , 2014, DaWaK.

[4]  Dimitrios Skoutas,et al.  Designing ETL processes using semantic web technologies , 2006, DOLAP '06.

[5]  Eyal Oren,et al.  Sindice.com: Weaving the Open Linked Data , 2007, ISWC/ASWC.

[6]  K Vivekanandan,et al.  An Ontological Approach to Handle Multidimensional Schema Evolution for Data Warehouse , 2014 .

[7]  Torben Bach Pedersen,et al.  Optimizing RDF Data Cubes for Efficient Processing of Analytical Queries , 2015, COLD.

[8]  Torben Bach Pedersen,et al.  Using Semantic Web Technologies for Exploratory OLAP: A Survey , 2015, IEEE Transactions on Knowledge and Data Engineering.

[9]  François Goasdoué,et al.  RDF analytics: lenses over semantic graphs , 2014, WWW.

[10]  Rafael Berlanga Llavori,et al.  Building data warehouses with semantic web data , 2012, Decis. Support Syst..

[11]  Dimitrios Skoutas,et al.  Ontology-Based Conceptual Design of ETL Processes for Both Structured and Semi-Structured Data , 2007, Int. J. Semantic Web Inf. Syst..

[12]  Torben Bach Pedersen,et al.  Multidimensional Integrated Ontologies: A Framework for Designing Semantic Data Warehouses , 2009, J. Data Semant..

[13]  Jeff Z. Pan,et al.  Resource Description Framework , 2020, Definitions.

[14]  Oscar Corcho,et al.  Methodological Guidelines for Publishing Government Linked Data , 2011 .

[15]  Irene Garrigós,et al.  Business Intelligence Applications and the Web: Models, Systems and Technologies , 2011 .

[16]  Torben Bach Pedersen,et al.  Publishing Danish Agricultural Government Data as Semantic Web Data , 2014, JIST.

[17]  Andreas Harth,et al.  Linked Data Management , 2014, Linked Data Management.

[18]  Rafael Berlanga Llavori,et al.  Building data warehouses with semantic data , 2010, EDBT '10.

[19]  Ladjel Bellatreche,et al.  Semantic Data Warehouse Design: From ETL to Deployment à la Carte , 2013, DASFAA.