High-Level ETL for Semantic Data Warehouses - Full Version

The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-Load (ETL) tools because they do not consider semantic issues in the integration process. In this paper, we propose a layer-based integration process and a set of high-level RDF-based ETL constructs required to define, map, extract, process, transform, integrate, update, and load (multidimensional) semantic data. Different to other ETL tools, we automate the ETL data flows by creating metadata at the schema level. Therefore, it relieves ETL developers from the burden of manual mapping at the ETL operation level. We create a prototype, named Semantic ETL Construct (SETLCONSTRUCT), based on the innovative ETL constructs proposed here. To evaluate SETLCONSTRUCT, we create a multidimensional semantic DW by integrating a Danish Business dataset and an EU Subsidy dataset using it and compare it with the previous programmable framework SETLPROG in terms of productivity, development time and performance. The evaluation shows that 1) SETLCONSTRUCT uses 92% fewer Number of Typed Characters (NOTC) than SETLPROG, and SETLAUTO (the extension of SETLCONSTRUCT for generating ETL execution flow automatically) further reduces the Number of Used Concepts (NOUC) by another 25%; 2) using SETLCONSTRUCT, the development time is almost cut in half compared to SETLPROG, and is cut by another 27% using SETLAUTO; 3) SETLCONSTRUCT is scalable and has similar performance compared to SETLPROG.

[1]  Torben Bach Pedersen,et al.  SETL: A programmable semantic extract-transform-load framework for semantic data warehouses , 2017, Inf. Syst..

[2]  Ian Horrocks,et al.  Reasoning Web. Semantic Technologies for Intelligent Data Access , 2013, Lecture Notes in Computer Science.

[3]  Marcelo Arenas,et al.  Semantics and complexity of SPARQL , 2006, TODS.

[4]  Torben Bach Pedersen,et al.  Optimizing Aggregate SPARQL Queries Using Materialized RDF Views , 2016, SEMWEB.

[5]  Srividya Kona Bansal,et al.  Towards a Semantic Extract-Transform-Load (ETL) Framework for Big Data Integration , 2014, 2014 IEEE International Congress on Big Data.

[6]  Lorena Etcheverry,et al.  Modeling and Querying Data Warehouses on the Semantic Web Using QB4OLAP , 2014, DaWaK.

[7]  Benedikt Kämpgen,et al.  Interacting with Statistical Linked Data via OLAP Operations , 2012, ILD@ESWC.

[8]  Torben Bach Pedersen,et al.  Towards a Programmable Semantic Extract-Transform-Load Framework for Semantic Data Warehouses , 2015, DOLAP.

[9]  K Vivekanandan,et al.  An Ontological Approach to Handle Multidimensional Schema Evolution for Data Warehouse , 2014 .

[10]  Torben Bach Pedersen,et al.  Optimizing RDF Data Cubes for Efficient Processing of Analytical Queries , 2015, COLD.

[11]  Torben Bach Pedersen,et al.  Using Semantic Web Technologies for Exploratory OLAP: A Survey , 2015, IEEE Transactions on Knowledge and Data Engineering.

[12]  Efthimios Tambouris,et al.  Challenges on Developing Tools for Exploiting Linked Open Data Cubes , 2015, SemStats@ISWC.

[13]  Ralph Kimball,et al.  The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses , 1996 .

[14]  François Goasdoué,et al.  RDF analytics: lenses over semantic graphs , 2014, WWW.

[15]  Torben Bach Pedersen,et al.  GeoSemOLAP: Geospatial OLAP on the Semantic Web Made Easy , 2017, WWW.

[16]  Egor V. Kostylev,et al.  CONSTRUCT Queries in SPARQL , 2015, ICDT.

[17]  Torben Bach Pedersen,et al.  Towards Answering Provenance-Enabled SPARQL Queries Over RDF Data Cubes , 2016, JIST.

[18]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[19]  Torben Bach Pedersen,et al.  Towards Exploratory OLAP Over Linked Open Data - A Case Study , 2014, BIRTE.

[20]  Sam Ruby,et al.  RESTful Web Services , 2007 .

[21]  Andreas Harth,et al.  Linked Data Management , 2014, Linked Data Management.

[22]  Ladjel Bellatreche,et al.  Semantic Data Warehouse Design: From ETL to Deployment à la Carte , 2013, DASFAA.

[23]  Dimitrios Skoutas,et al.  Ontology-Based Conceptual Design of ETL Processes for Both Structured and Semi-Structured Data , 2007, Int. J. Semantic Web Inf. Syst..

[24]  Torben Bach Pedersen,et al.  Dimensional enrichment of statistical linked open data , 2016, J. Web Semant..

[25]  Lorena Etcheverry,et al.  Modeling and Querying Data Cubes on the Semantic Web , 2015, ArXiv.

[26]  Masaki Aono,et al.  Augmentation of ontology instance matching by automatic weight generation , 2011, 2011 World Congress on Information and Communication Technologies.

[27]  Roland Bouman,et al.  Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration , 2010 .

[28]  Katja Hose,et al.  QBOAirbase: The European Air Quality Database as an RDF Cube , 2017, International Semantic Web Conference.

[29]  Daniel P. Miranker,et al.  Ultrawrap: SPARQL execution on relational data , 2013, J. Web Semant..

[30]  Jürgen Umbrich,et al.  RDFS and OWL Reasoning for Linked Data , 2013, Reasoning Web.

[31]  Torben Bach Pedersen,et al.  Publishing Danish Agricultural Government Data as Semantic Web Data , 2014, JIST.

[32]  Masaki Aono,et al.  Resolving scalability issue to ontology instance matching in Semantic Web , 2012, 2012 15th International Conference on Computer and Information Technology (ICCIT).

[33]  Torben Bach Pedersen,et al.  Evaluating XML-extended OLAP queries based on a physical algebra , 2004, DOLAP '04.

[34]  Christoph G. Schütz,et al.  An OLAP Endpoint for RDF Data Analysis Using Analysis Graphs , 2017, International Semantic Web Conference.

[35]  Torben Bach Pedersen,et al.  Answering Provenance-Aware Queries on RDF Data Cubes Under Memory Budgets , 2018, International Semantic Web Conference.

[36]  Torben Bach Pedersen,et al.  Integrating XML data in the TARGIT OLAP system , 2004, Proceedings. 20th International Conference on Data Engineering.

[37]  Rafael Berlanga Llavori,et al.  Building data warehouses with semantic web data , 2012, Decis. Support Syst..

[38]  Diego Calvanese,et al.  The description logic handbook: theory , 2003 .

[39]  Masaki Aono,et al.  An Efficient and Scalable Approach for Ontology Instance Matching , 2014, J. Comput..

[40]  Torben Bach Pedersen,et al.  A foundation for spatial data warehouses on the Semantic Web , 2018, Semantic Web.

[41]  Torben Bach Pedersen,et al.  Processing Aggregate Queries in a Federation of SPARQL Endpoints , 2015, ESWC.

[42]  Rik Van de Walle,et al.  RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data , 2014, LDOW.

[43]  Martin Necaský,et al.  UnifiedViews: An ETL tool for RDF data management , 2018, Semantic Web.

[44]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[45]  Yi Li,et al.  RiMOM: A Dynamic Multistrategy Ontology Alignment Framework , 2009, IEEE Transactions on Knowledge and Data Engineering.

[46]  Gerard de Melo,et al.  FrameBase: Enabling integration of heterogeneous knowledge , 2017, Semantic Web.

[47]  Gerard de Melo,et al.  Heuristics for Connecting Heterogeneous Knowledge via FrameBase , 2016, ESWC.

[48]  Torben Bach Pedersen,et al.  SETLBI: An Integrated Platform for Semantic Business Intelligence , 2020, WWW.

[49]  George Papastefanatos,et al.  Publishing Greek Census Data as Linked Open Data , 2014, ERCIM News.

[50]  Torben Bach Pedersen,et al.  Query optimization for OLAP-XML federations , 2002, DOLAP '02.

[51]  Cristina Dutra de Aguiar Ciferri,et al.  Cube Algebra: A Generic User-Centric Model and Query Language for OLAP Cubes , 2013, Int. J. Data Warehous. Min..

[52]  Philippe Cudré-Mauroux,et al.  Leveraging Knowledge Graphs for Big Data Integration: the XI Pipeline , 2020, Semantic Web.