SETL: A programmable semantic extract-transform-load framework for semantic data warehouses

In order to create better decisions for business analytics, organizations increasingly use external structured, semi-structured, and unstructured data in addition to the (mostly structured) internal data. Current Extract-Transform-Load (ETL) tools are not suitable for this “open world scenario” because they do not consider semantic issues in the integration processing. Current ETL tools neither support processing semantic data nor create a semantic Data Warehouse (DW), a repository of semantically integrated data. This paper describes our programmable Semantic ETL (SETL) framework. SETL builds on Semantic Web (SW) standards and tools and supports developers by offering a number of powerful modules, classes, and methods for (dimensional and semantic) DW constructs and tasks. Thus it supports semantic data sources in addition to traditional data sources, semantic integration, and creating or publishing a semantic (multidimensional) DW in terms of a knowledge base. A comprehensive experimental evaluation comparing SETL to a solution made with traditional tools (requiring much more hand-coding) on a concrete use case, shows that SETL provides better programmer productivity, knowledge base quality, and performance.

[1]  Masaki Aono,et al.  Augmentation of ontology instance matching by automatic weight generation , 2011, 2011 World Congress on Information and Communication Technologies.

[2]  Egor V. Kostylev,et al.  CONSTRUCT Queries in SPARQL , 2015, ICDT.

[3]  Torben Bach Pedersen,et al.  Publishing Danish Agricultural Government Data as Semantic Web Data , 2014, JIST.

[4]  Oscar Romero,et al.  Supporting Data Integration Tasks with Semi-Automatic Ontology Construction , 2015, DOLAP.

[5]  Torben Bach Pedersen,et al.  Multidimensional Databases and Data Warehousing , 2010, Multidimensional Databases and Data Warehousing.

[6]  Rafael Berlanga Llavori,et al.  Building data warehouses with semantic web data , 2012, Decis. Support Syst..

[7]  Frank van Harmelen,et al.  KR and Reasoning on the Semantic Web: Web-Scale Reasoning , 2011, Handbook of Semantic Web Technologies.

[8]  Marc Ehrig Ontology Alignment: Bridging the Semantic Gap (Semantic Web and Beyond) , 2006 .

[9]  Diego Calvanese,et al.  The description logic handbook: theory , 2003 .

[10]  Torben Bach Pedersen,et al.  XML-extended OLAP querying , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[11]  Dimitrios Skoutas,et al.  Ontology-Based Conceptual Design of ETL Processes for Both Structured and Semi-Structured Data , 2007, Int. J. Semantic Web Inf. Syst..

[12]  Georgios Paliouras,et al.  Ontology Population and Enrichment: State of the Art , 2011, Knowledge-Driven Multimedia Information Extraction and Ontology Evolution.

[13]  Torben Bach Pedersen,et al.  Towards a Programmable Semantic Extract-Transform-Load Framework for Semantic Data Warehouses , 2015, DOLAP.

[14]  Torben Bach Pedersen,et al.  Semantic Web Technologies for Business Intelligence , 2011 .

[15]  Christian Bizer,et al.  The Berlin SPARQL Benchmark , 2009, Int. J. Semantic Web Inf. Syst..

[16]  Matteo Golfarelli,et al.  Data Warehouse Design: Modern Principles and Methodologies , 2009 .

[17]  Norman W. Paton,et al.  Structure Inference for Linked Data Sources Using Clustering , 2015, Trans. Large Scale Data Knowl. Centered Syst..

[18]  Roland Bouman,et al.  Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration , 2010 .

[19]  Torben Bach Pedersen,et al.  Multidimensional Integrated Ontologies: A Framework for Designing Semantic Data Warehouses , 2009, J. Data Semant..

[20]  K Vivekanandan,et al.  An Ontological Approach to Handle Multidimensional Schema Evolution for Data Warehouse , 2014 .

[21]  Torben Bach Pedersen,et al.  Optimizing RDF Data Cubes for Efficient Processing of Analytical Queries , 2015, COLD.

[22]  Esteban Zimányi,et al.  Requirements Engineering for Data Warehouses , 2015, Journées Francophones sur les Entrepôts de Données et l'Analyse en ligne.

[23]  Shufeng Zhou,et al.  R2RML Processor for Materializing RDF View of Relational Data: Algorithms and Experiments , 2013, 2013 10th Web Information System and Application Conference.

[24]  Andreas Harth,et al.  Linked Data Management , 2014, Linked Data Management.

[25]  Rafael Berlanga Llavori,et al.  Building data warehouses with semantic data , 2010, EDBT '10.

[26]  Ladjel Bellatreche,et al.  Semantic Data Warehouse Design: From ETL to Deployment à la Carte , 2013, DASFAA.

[27]  Torben Bach Pedersen,et al.  Using Semantic Web Technologies for Exploratory OLAP: A Survey , 2015, IEEE Transactions on Knowledge and Data Engineering.

[28]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[29]  Freddy Priyatna,et al.  Relational Database to RDF Mapping Patterns , 2012, WOP.

[30]  Ralph Kimball,et al.  The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling , 1996 .

[31]  Esteban Zimányi,et al.  Data Warehouse Systems , 2014, Data-Centric Systems and Applications.

[32]  Pavel Tyl Ontology Matching for Web Services Composition , 2011 .

[33]  François Goasdoué,et al.  RDF analytics: lenses over semantic graphs , 2014, WWW.

[34]  Irene Garrigós,et al.  Business Intelligence Applications and the Web: Models, Systems and Technologies , 2011 .

[35]  Oscar Corcho,et al.  Methodological Guidelines for Publishing Government Linked Data , 2011 .

[36]  Torben Bach Pedersen,et al.  pygrametl: a powerful programming framework for extract-transform-load programmers , 2009, DOLAP.

[37]  Anh Duong Hoang Thi A Semantic Approach towards CWM-based ETL Processes , 2008 .

[38]  Wolfgang Lehner,et al.  Quality measures for ETL processes: from goals to implementation , 2014, Concurr. Comput. Pract. Exp..

[39]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[40]  Marc Ehrig,et al.  Ontology Alignment: Bridging the Semantic Gap , 2006 .

[41]  Torben Bach Pedersen,et al.  Processing Aggregate Queries in a Federation of SPARQL Endpoints , 2015, ESWC.

[42]  Gottfried Vossen,et al.  Towards Self-Service Business Intelligence , 2013 .

[43]  Hiroyuki Kitagawa,et al.  SPOOL: a SPARQL-based ETL framework for OLAP over linked data , 2015, iiWAS.

[44]  Srividya Kona Bansal,et al.  Towards a Semantic Extract-Transform-Load (ETL) Framework for Big Data Integration , 2014, 2014 IEEE International Congress on Big Data.

[45]  José Samos,et al.  YAM2: a multidimensional conceptual model extending UML , 2006, Inf. Syst..

[46]  Lorena Etcheverry,et al.  Modeling and Querying Data Warehouses on the Semantic Web Using QB4OLAP , 2014, DaWaK.

[47]  Dimitrios Skoutas,et al.  Designing ETL processes using semantic web technologies , 2006, DOLAP '06.

[48]  Yi Li,et al.  RiMOM: A Dynamic Multistrategy Ontology Alignment Framework , 2009, IEEE Transactions on Knowledge and Data Engineering.

[49]  Eyal Oren,et al.  Sindice.com: Weaving the Open Linked Data , 2007, ISWC/ASWC.