Representation of conceptual ETL designs in natural language using Semantic Web technology

Extract-Transform-Load (ETL) processes constitute the back stage of Data Warehouse architectures. Several studies characterize the ETL design as a time-consuming and error-prone procedure. A critical phase in the ETL lifecycle involves the early communications and design steps that aim at producing a conceptual ETL design. Various research approaches have dealt with the conceptual modeling of ETL processes, but all share two inconveniences: they require intensive human effort from the designers to create them, as well as technical knowledge from the business people to understand them. In this paper, we focus on the second aspect and provide a method for the representation of a conceptual ETL design as a narrative, which is the most natural means of communication and does not require particular technical skills or familiarity with any specific model. Specifically, this work builds upon previously proposed techniques that automate the conceptual design by leveraging Semantic Web technology. The key idea is to map the involved data stores, either source or target, to a domain ontology and then, to use a reasoner for producing the ETL design. We discuss how linguistic techniques can be used for the establishment of a common application vocabulary. We present a flexible and customizable template-based mechanism for the representation of the ETL design as a narrative. Finally, we discuss issues related to the production of meaningful reports and we provide implementation details.

[1]  Dimitrios Skoutas,et al.  Designing ETL processes using semantic web technologies , 2006, DOLAP '06.

[2]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[3]  Timos K. Sellis,et al.  State-space optimization of ETL workflows , 2005, IEEE Transactions on Knowledge and Data Engineering.

[4]  Russell Sinclair Data Transformation Services , 2000 .

[5]  Luis Gravano,et al.  Using q-grams in a DBMS for Approximate String Processing , 2001, IEEE Data Eng. Bull..

[6]  Dimitrios Skoutas,et al.  Flexible and Customizable NL Representation of Requirements for ETL processes , 2007, NLDB.

[7]  A Min Tjoa,et al.  Transformation of Requirement Specifications Expressed in Natural Language into an EER Model , 1993, ER.

[8]  Georgia Koutrika,et al.  Synthesizing structured text from logical database subsets , 2008, EDBT '08.

[9]  Graham Wilcock Talking OWLs: Towards an Ontology Verbalizer , 2003 .

[10]  Bijan Parsia,et al.  Description Logic Reasoning for Dynamic ABoxes , 2006, Description Logics.

[11]  Berthold Reinwald,et al.  Discovering topical structures of databases , 2008, SIGMOD Conference.

[12]  Kalina Bontcheva Generating Tailored Textual Summaries from Ontologies , 2005, ESWC.

[13]  Alkis Simitsis,et al.  Mapping conceptual to logical models for ETL processes , 2005, DOLAP '05.

[14]  George A. Miller,et al.  WordNet: A Lexical Database for the English Language , 2002 .

[15]  Elisabeth Métais,et al.  Database Schema Design: A Perspective From Natural Language Techniques to Validation and View Integration , 1993, ER.

[16]  B. Parsia,et al.  Towards Incremental Reasoning Through Updates in OWL-DL , 2006 .

[17]  Alberto Abelló,et al.  Automating multidimensional design from ontologies , 2007, DOLAP '07.

[18]  Panos Vassiliadis,et al.  Conceptual modeling for ETL processes , 2002, DOLAP '02.

[19]  Panos Vassiliadis,et al.  Data Mapping Diagrams for Data Warehouse Design with UML , 2004, ER.

[20]  George Papastefanatos,et al.  Policy-Regulated Management of ETL Evolution , 2009, J. Data Semant..

[21]  James A. Hendler,et al.  Toward expressive syndication on the web , 2007, WWW '07.

[22]  Jeffrey F. Naughton,et al.  The Case for a Structured Approach to Managing Unstructured Data , 2009, CIDR.

[23]  Dimitrios Skoutas,et al.  Natural language reporting for ETL processes , 2008, DOLAP '08.

[24]  Bijan Parsia,et al.  Description Logic Reasoning with Syntactic Updates , 2006, OTM Conferences.

[25]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[26]  Panos Vassiliadis,et al.  A generic and customizable framework for the design of ETL scenarios , 2005, Inf. Syst..

[27]  Zoubida Kedad,et al.  Ontology-Based Data Cleaning , 2002, NLDB.

[28]  Eduard H. Hovy,et al.  Aggregation in Natural Language Generation , 1993, EWNLG.

[29]  Yannis E. Ioannidis,et al.  DBMSs Should Talk Back Too , 2009, CIDR.

[30]  Leonid Kof,et al.  Natural Language Processing: Mature Enough for Requirements Documents Analysis? , 2005, NLDB.

[31]  Mario Piattini,et al.  Applying MDA to the development of data warehouses , 2005, DOLAP '05.

[32]  Tony Bain,et al.  Data Transformation Services , 2004 .

[33]  Bernardo Cuenca Grau,et al.  History Matters: Incremental Ontology Reasoning Using Modules , 2007, ISWC/ASWC.

[34]  Kalina Bontcheva,et al.  Automatic Report Generation from Ontologies: The MIAKT Approach , 2004, NLDB.

[35]  Emiel Krahmer,et al.  Squibs and Discussions: Real versus Template-Based Natural Language Generation: A False Opposition? , 2005, CL.

[36]  Xiang Peng,et al.  A user profile-based approach for personal information access: shaping your information portfolio , 2006, WWW '06.

[37]  Veda C. Storey,et al.  Naive Semantics to Support Automated Database Design , 2002, IEEE Trans. Knowl. Data Eng..

[38]  Juan Trujillo,et al.  A UML Based Approach for Modeling ETL Processes in Data Warehouses , 2003, ER.

[39]  Beum-Seuk Lee,et al.  Contextual Knowledge Representation for Requirements Documents in Natural Language , 2002, FLAIRS Conference.

[40]  John Mylopoulos,et al.  Experimenting with Linguistic Tools for Conceptual Modelling: Quality of the Models and Critical Features , 2004, NLDB.

[41]  John Levine,et al.  Automatic generation of technical documentation , 1994, Appl. Artif. Intell..

[42]  Ralph Kimball,et al.  The Data Warehouse Lifecycle Toolkit , 2009 .

[43]  Kristiina Jokinen,et al.  Generating Responses and Explanations from RDF/XML and DAML+OIL , 2003 .