Graph-based ETL Processes for Warehousing Statistical Open Data

Warehousing is a promising mean to cross and analyse Statistical Open Data (SOD). But extracting structures, integrating and defining multidimensional schema from several scattered and heterogeneous tables in the SOD are major problems challenging the traditional ETL (Extract-Transform-Load) processes. In this paper, we present a three step ETL processes which rely on RDF graphs to meet all these problems. In the first step, we automatically extract tables structures and values using a table anatomy ontology. This phase converts structurally heterogeneous tables into a unified RDF graph representation. The second step performs a holistic integration of several semantically heterogeneous RDF graphs. The optimal integration is performed through an Integer Linear Program (ILP). In the third step, system interacts with users to incrementally transform the integrated RDF graph into a multidimensional schema.

[1]  Svetlana Mansmann Empowering the OLAP Technology to Support Complex Dimension Hierarchies , 2009, Selected Readings on Database Technologies and Applications.

[2]  Maurizio Vincini,et al.  A semantic approach to ETL technologies , 2011, Data Knowl. Eng..

[3]  Jacky Akoka,et al.  Multidimensional models meet the semantic web: defining and reasoning on OWL-DL ontologies for OLAP , 2012, DOLAP '12.

[4]  Lorena Etcheverry,et al.  Modeling and Querying Data Warehouses on the Semantic Web Using QB4OLAP , 2014, DaWaK.

[5]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[6]  Jose-Norberto Mazón,et al.  A survey on summarizability issues in multidimensional modeling , 2009, Data Knowl. Eng..

[7]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[8]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[9]  Arie Shoshani,et al.  Summarizability in OLAP and statistical data bases , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[10]  Olivier Teste,et al.  A Content-Driven ETL Processes for Open Data , 2014, ADBIS.

[11]  Frank Plastria,et al.  Formulating logical implications in combinatorial optimisation , 2002, Eur. J. Oper. Res..

[12]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[13]  Alberto Abelló,et al.  Automating multidimensional design from ontologies , 2007, DOLAP '07.

[14]  Olivier Teste,et al.  Algebraic and Graphic Languages for OLAP Manipulations , 2008, Int. J. Data Warehous. Min..

[15]  Esteban Zimányi,et al.  Hierarchies in a multidimensional model: From conceptual modeling to logical representation , 2006, Data Knowl. Eng..