Ontology-Driven Conceptual Design of ETL Processes Using Graph Transformations

One of the main tasks during the early steps of a data warehouse project is the identification of the appropriate transformations and the specification of inter-schema mappings from the source to the target data stores. This is a challenging task, requiring firstly the semantic and secondly the structural reconciliation of the information provided by the available sources. This task is a part of the Extract-Transform-Load (ETL) process, which is responsible for the population of the data warehouse. In this paper, we propose a customizable and extensible ontology-driven approach for the conceptual design of ETL processes. A graph-based representation is used as a conceptual model for the source and target data stores. We then present a method for devising flows of ETL operations by means of graph transformations. In particular, the operations comprising the ETL process are derived through graph transformation rules, the choice and applicability of which are determined by the semantics of the data with respect to an attached domain ontology. Finally, we present our experimental findings that demonstrate the applicability of our approach.

[1]  Jun Hong,et al.  Flexible and Efficient Information Handling, 23rd British National Conference on Databases, BNCOD 23, Belfast, Northern Ireland, UK, July 18-20, 2006, Proceedings , 2006, BNCOD.

[2]  Zoubida Kedad,et al.  Ontology-Based Data Cleaning , 2002, NLDB.

[3]  Dimitrios Skoutas,et al.  Designing ETL processes using semantic web technologies , 2006, DOLAP '06.

[4]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[5]  George Papastefanatos,et al.  Policy-Regulated Management of ETL Evolution , 2009, J. Data Semant..

[6]  W. N. Borst,et al.  Construction of Engineering Ontologies for Knowledge Sharing and Reuse , 1997 .

[7]  Jérôme Euzenat,et al.  A Survey of Schema-Based Matching Approaches , 2005, J. Data Semant..

[8]  David R. Karger,et al.  Potluck: Data Mash-Up Tool for Casual Users , 2007, ISWC/ASWC.

[9]  King-Sun Fu,et al.  A distance measure between attributed relational graphs for pattern recognition , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[10]  John Mylopoulos,et al.  Journal on Data Semantics IX , 2007, Journal on Data Semantics IX.

[11]  Jose-Norberto Mazón,et al.  Enriching Data Warehouse Dimension Hierarchies by Using Semantic Relations , 2006, BNCOD.

[12]  Dimitrios Skoutas,et al.  Flexible and Customizable NL Representation of Requirements for ETL processes , 2007, NLDB.

[13]  Juan Trujillo,et al.  A UML Based Approach for Modeling ETL Processes in Data Warehouses , 2003, ER.

[14]  Fausto Giunchiglia,et al.  Semantic Matching: Algorithms and Implementation , 2007, J. Data Semant..

[15]  Gottfried Vossen,et al.  Conceptual data warehouse modeling , 2000, DMDW.

[16]  Georg Gottlob,et al.  Web Data Extraction for Business Intelligence: The Lixto Approach , 2005, BTW.

[17]  Hans-Arno Jacobsen,et al.  G-ToPSS: fast filtering of graph-based metadata , 2005, WWW '05.

[18]  C. Carpentier,et al.  PIPEs , 2005 .

[19]  Timos K. Sellis,et al.  State-space optimization of ETL workflows , 2005, IEEE Transactions on Knowledge and Data Engineering.

[20]  Robert Griesemer Oracle Warehouse Builder 11g: Getting Started , 2009 .

[21]  Stefano Spaccapietra Journal on Data Semantics IV , 2005, Journal on Data Semantics IV.

[22]  Dimitrios Skoutas,et al.  Ontology-Based Conceptual Design of ETL Processes for Both Structured and Semi-Structured Data , 2007, Int. J. Semantic Web Inf. Syst..

[23]  David R. Karger,et al.  Potluck: Semi-Ontology Alignment for Casual Users , 2007, Semantic Web Challenge.

[24]  Yannis Tzitzikas,et al.  How to Tame a Very Large ER Diagram (Using Link Analysis and Force-Directed Drawing Algorithms) , 2005, ER.

[25]  Mark Pruett,et al.  Yahoo! pipes , 2007 .

[26]  Lois M. L. Delcambre Conceptual Modeling - ER 2005, 24th International Conference on Conceptual Modeling, Klagenfurt, Austria, October 24-28, 2005, Proceedings , 2005, ER.

[27]  Dimitrios Skoutas,et al.  Natural language reporting for ETL processes , 2008, DOLAP '08.

[28]  Horst Bunke,et al.  A New Algorithm for Error-Tolerant Subgraph Isomorphism Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Jyrki Nummenmaa,et al.  Ontologies with Semantic Web/Grid in Data Integration for OLAP , 2007, Int. J. Semantic Web Inf. Syst..

[30]  Alberto Abelló,et al.  Automating multidimensional design from ontologies , 2007, DOLAP '07.

[31]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[32]  Mario Piattini,et al.  Applying MDA to the development of data warehouses , 2005, DOLAP '05.

[33]  Grzegorz Rozenberg,et al.  Handbook of Graph Grammars and Computing by Graph Transformations, Volume 1: Foundations , 1997 .

[34]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[35]  Panos Vassiliadis,et al.  A generic and customizable framework for the design of ETL scenarios , 2005, Inf. Syst..

[36]  Edwin R. Hancock,et al.  Bayesian graph edit distance , 1999, Proceedings 10th International Conference on Image Analysis and Processing.

[37]  José Luis Ambite,et al.  Automatically Composing Data Workflows with Relational Descriptions and Shim Services , 2007, ISWC/ASWC.

[38]  Panos Vassiliadis,et al.  Conceptual modeling for ETL processes , 2002, DOLAP '02.

[39]  Hartmut Ehrig,et al.  Handbook of graph grammars and computing by graph transformation: vol. 3: concurrency, parallelism, and distribution , 1999 .

[40]  Panos Vassiliadis,et al.  Data Mapping Diagrams for Data Warehouse Design with UML , 2004, ER.