Automating the multidimensional design of data warehouses

Les experiencies previes en l'ambit dels magatzems de dades (o data warehouse), mostren que l'esquema multidimensional del data warehouse ha de ser fruit d'un enfocament hibrid; aixo es, una proposta que consideri tant els requeriments d'usuari com les fonts de dades durant el proces de disseny. Com a qualsevol altre sistema, els requeriments son necessaris per garantir que el sistema desenvolupat satisfa les necessitats de l'usuari. A mes, essent aquest un proces de reenginyeria, les fonts de dades s'han de tenir en compte per: (i) garantir que el magatzem de dades resultant pot esser poblat amb dades de l'organitzacio, i, a mes, (ii) descobrir capacitats d'analisis no evidents o no conegudes per l'usuari. Actualment, a la literatura s'han presentat diversos metodes per donar suport al proces de modelatge del magatzem de dades. No obstant aixo, les propostes basades en un analisi dels requeriments assumeixen que aquestos son exhaustius, i no consideren que pot haver-hi informacio rellevant amagada a les fonts de dades. Contrariament, les propostes basades en un analisi exhaustiu de les fonts de dades maximitzen aquest enfocament, i proposen tot el coneixement multidimensional que es pot derivar des de les fonts de dades i, consequentment, generen massa resultats. En aquest escenari, l'automatitzacio del disseny del magatzem de dades es essencial per evitar que tot el pes de la tasca recaigui en el dissenyador (d'aquesta forma, no hem de confiar unicament en la seva habilitat i coneixement per aplicar el metode de disseny elegit). A mes, l'automatitzacio de la tasca allibera al dissenyador del sempre complex i costos analisi de les fonts de dades (que pot arribar a ser inviable per grans fonts de dades). Avui dia, els metodes automatitzables analitzen en detall les fonts de dades i passen per alt els requeriments. En canvi, els metodes basats en l'analisi dels requeriments no consideren l'automatitzacio del proces, ja que treballen amb requeriments expressats en llenguatges d'alt nivell que un ordenador no pot manegar. Aquesta mateixa situacio es dona en els metodes hibrids actual, que proposen un enfocament sequencial, on l'analisi de les dades es complementa amb l'analisi dels requeriments, ja que totes dues tasques pateixen els mateixos problemes que els enfocament purs. En aquesta tesi proposem dos metodes per donar suport a la tasca de modelatge del magatzem de dades: MDBE (Multidimensional Design Based on Examples) and AMDO (Automating the Multidimensional Design from Ontologies). Totes dues consideren els requeriments i les fonts de dades per portar a terme la tasca de modelatge i a mes, van ser pensades per superar les limitacions dels enfocaments actuals. 1. MDBE segueix un enfocament classic, en el que els requeriments d'usuari son coneguts d'avantma. Aquest metode es beneficia del coneixement capturat a les fonts de dades, pero guia el proces des dels requeriments i, consequentment, es capac de treballar sobre fonts de dades semanticament pobres. Es a dir, explotant el fet que amb uns requeriments de qualitat, podem superar els inconvenients de disposar de fonts de dades que no capturen apropiadament el nostre domini de treball. 2. A diferencia d'MDBE, AMDO assumeix un escenari on es disposa de fonts de dades semanticament riques. Per aquest motiu, dirigeix el proces de modelatge des de les fonts de dades, i empra els requeriments per donar forma i adaptar els resultats generats a les necessitats de l'usuari. En aquest context, a diferencia de l'anterior, unes fonts de dades semanticament riques esmorteeixen el fet de no tenir clars els requeriments d'usuari d'avantma. Cal notar que els nostres metodes estableixen un marc de treball combinat que es pot emprar per decidir, donat un escenari concret, quin enfocament es mes adient. Per exemple, no es pot seguir el mateix enfocament en un escenari on els requeriments son ben coneguts d'avantma i en un escenari on aquestos encara no estan clars (un cas recorrent d'aquesta situacio es quan l'usuari no te clares les capacitats d'analisi del seu propi sistema). De fet, disposar d'uns bons requeriments d'avantma esmorteeix la necessitat de disposar de fonts de dades semanticament riques, mentre que a l'inversa, si disposem de fonts de dades que capturen adequadament el nostre domini de treball, els requeriments no son necessaris d'avantma. Per aquests motius, en aquesta tesi aportem un marc de treball combinat que cobreix tots els possibles escenaris que podem trobar durant la tasca de modelatge del magatzem de dades.

[1]  Chang Li,et al.  A data model for supporting on-line analytical processing , 1996, CIKM '96.

[2]  Gottfried Vossen,et al.  Conceptual data warehouse modeling , 2000, DMDW.

[3]  Jose-Norberto Mazón,et al.  A survey on summarizability issues in multidimensional modeling , 2009, Data Knowl. Eng..

[4]  Heikki Mannila,et al.  On the Complexity of Inferring Functional Dependencies , 1992, Discret. Appl. Math..

[5]  Luca Cabibbo,et al.  Querying Multidimensional Databases , 1997, DBPL.

[6]  Matthias Jarke,et al.  Multidimensional Data Models and Aggregation , 2000 .

[7]  Gyula O. H. Katona,et al.  Functional dependencies distorted by errors , 2008, Discret. Appl. Math..

[8]  Anthony C. Klug Equivalence of Relational Algebra and Relational Calculus Query Languages Having Aggregate Functions , 1982, JACM.

[9]  Karen C. Davis,et al.  Automating data warehouse conceptual schema design and evaluation , 2002, DMDW.

[10]  E. F. Codd,et al.  Relational Completeness of Data Base Sublanguages , 1972, Research Report / RJ / IBM / San Jose, California.

[11]  Diego Calvanese,et al.  Path-Based Identification Constraints in Description Logics , 2008, KR.

[12]  Antoni Olivé Ramon,et al.  EU-Rent car rentals specification , 2003 .

[13]  Anjana Gosain,et al.  Informational Scenarios for Data Warehouse Requirements Elicitation , 2004, ER.

[14]  Diego Calvanese,et al.  Tractable Reasoning and Efficient Query Answering in Description Logics: The DL-Lite Family , 2007, Journal of Automated Reasoning.

[15]  Anindya Datta,et al.  The cube data model: a conceptual model and algebra for on-line analytical processing in data warehouses , 1999, Decis. Support Syst..

[16]  David Toman,et al.  On the Interaction between Inverse Features and Path-functional Dependencies in Description Logics , 2005, IJCAI.

[17]  Diego Calvanese,et al.  Description Logics for Conceptual Data Modeling , 1998, Logics for Databases and Information Systems.

[18]  Mohand-Said Hacid,et al.  Modeling multidimensional database: a formal object-oriented approach , 1998, ECIS.

[19]  Diego Calvanese,et al.  Reasoning over Extended ER Models , 2007, ER.

[20]  Michael Böhnlein,et al.  Deriving initial data warehouse structures from the conceptual data models of the underlying operational information systems , 1999, DOLAP '99.

[21]  Torben Bach Pedersen,et al.  Aspects of Data Modeling and Query Processing for Complex Multidimensional Data , 2000 .

[22]  Paul Brown,et al.  GORDIAN: efficient and scalable discovery of composite keys , 2006, VLDB.

[23]  Yuan Zhao,et al.  Automated elicitation of functional dependencies from source codes of database transactions , 2004, Inf. Softw. Technol..

[24]  Ralph Kimball,et al.  The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing and Deploying Data Warehouses with CD Rom , 1998 .

[25]  Moshe Y. Vardi,et al.  Polynomial-time implication problems for unary inclusion dependencies , 1990, JACM.

[26]  José Samos,et al.  YAM/sup 2/ (yet another multidimensional model): an extension of UML , 2002, Proceedings International Database Engineering and Applications Symposium.

[27]  Matteo Golfarelli,et al.  A methodological framework for data warehouse design , 1998, DOLAP '98.

[28]  Diego Calvanese,et al.  Linking Data to Ontologies , 2008, J. Data Semant..

[29]  Sunita Sarawagi,et al.  Modeling multidimensional databases , 1997, Proceedings 13th International Conference on Data Engineering.

[30]  Stefano Paraboschi,et al.  Designing data marts for data warehouses , 2001, TSEM.

[31]  Wie Ming Lim Discovery of constraints from data for information system reverse engineering , 1997, Proceedings of Australian Software Engineering Conference ASWEC 97.

[32]  Peter A. Flach,et al.  Database Dependency Discovery: A Machine Learning Approach , 1999, AI Commun..

[33]  Jean-Luc Hainaut,et al.  Contribution to a theory of database reverse engineering , 1993, [1993] Proceedings Working Conference on Reverse Engineering.

[34]  Panos Vassiliadis,et al.  Data Warehouse Modeling and Quality Issues , 2000 .

[35]  Ian Horrocks,et al.  Using an Expressive Description Logic: FaCT or Fiction? , 1998, KR.

[36]  Eric S. K. Yu,et al.  Towards modelling and reasoning support for early-phase requirements engineering , 1997, Proceedings of ISRE '97: 3rd IEEE International Symposium on Requirements Engineering.

[37]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[38]  Enrico Franconi,et al.  The GMD Data Model and Algebra for Multidimensional Information , 2004, CAiSE.

[39]  Alberto Abelló,et al.  Improving automatic SQL translation for ROLAP tools , 2005, JISBD.

[40]  Vladan Devedzic,et al.  MDA-based Automatic OWL Ontology Development , 2006, International Journal on Software Tools for Technology Transfer.

[41]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[42]  Torben Bach Pedersen,et al.  Multidimensional Database Technology , 2001, Computer.

[43]  Panos Vassiliadis,et al.  Modeling multidimensional databases, cubes and cube operations , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[44]  Matteo Golfarelli,et al.  The Dimensional Fact Model: A Conceptual Model for Data Warehouses , 1998, Int. J. Cooperative Inf. Syst..

[45]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[46]  Anindya Datta,et al.  A Conceptual Model and Algebra for On-Line Analytical Processing in Decision Support Databases , 2001, Inf. Syst. Res..

[47]  Arie Shoshani,et al.  Summarizability in OLAP and statistical data bases , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[48]  János Demetrovics,et al.  Some Remarks On Generating Armstrong And Inferring Functional Dependencies Relation , 1995, Acta Cybern..

[49]  Robert Winter,et al.  A method for demand-driven information requirements analysis in data warehousing projects , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[50]  Kim S. Larsen On Grouping in Relational Algebra , 1999, Int. J. Found. Comput. Sci..

[51]  E. F. Codd,et al.  The Relational Model for Database Management, Version 2 , 1990 .

[52]  Diego Calvanese,et al.  Identification Constraints and Functional Dependencies in Description Logics , 2001, IJCAI.

[53]  Terry A. Halpin,et al.  Information modeling and relational databases (2. ed.) , 2008 .

[54]  Moshe Y. Vardi Why is Modal Logic So Robustly Decidable? , 1996, Descriptive Complexity and Finite Models.

[55]  José Samos,et al.  Implementing operations to navigate semantic star schemas , 2003, DOLAP '03.

[56]  Antoni Olivé,et al.  On the Role of Conceptual Schemas in Information Systems Development , 2004, Ada-Europe.

[57]  W. H. Inmon,et al.  Building the data warehouse , 1992 .

[58]  Isabelle Comyn-Wattiau,et al.  A UML-based data warehouse design method , 2006, Decis. Support Syst..

[59]  Beate List,et al.  A Comparison of Data Warehouse Development Methodologies Case Study of the Process Warehouse , 2002, DEXA.

[60]  Alberto Abelló,et al.  On the Need of a Reference Algebra for OLAP , 2007, DaWaK.

[61]  Jose-Norberto Mazón,et al.  Reconciling requirement-driven data warehouses with data sources via multidimensional normal forms , 2007, Data Knowl. Eng..

[62]  Matteo Golfarelli,et al.  Data Warehouse Design: Modern Principles and Methodologies , 2009 .

[63]  Wolfgang Lehner,et al.  Normal forms for multidimensional databases , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[64]  Torben Bach Pedersen,et al.  Discovering Multidimensional Structure in Relational Data , 2004, DaWaK.

[65]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[66]  Diego Calvanese,et al.  Data Complexity of Query Answering in Description Logics , 2006, Description Logics.

[67]  Laks V. S. Lakshmanan,et al.  A Foundation for Multi-dimensional Databases , 1997, VLDB.

[68]  Timos K. Sellis,et al.  A survey of logical models for OLAP databases , 1999, SGMD.

[69]  Ulrike Sattler,et al.  An object-centered multi-dimensional data model with hierarchically structured dimensions , 1997, Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop.

[70]  Heikki Mannila,et al.  Discovering functional and inclusion dependencies in relational databases , 1992, Int. J. Intell. Syst..

[71]  Diego Calvanese,et al.  Discovering functional dependencies for multidimensional design , 2009, DOLAP.

[72]  Daniel L. Moody,et al.  From enterprise models to dimensional models: a methodology for data warehouse and data mart design , 2000, DMDW.

[73]  Volker Haarslev,et al.  Description of the RACER System and its Applications , 2001, Description Logics.

[74]  Yannis Manolopoulos,et al.  Closest pair queries in spatial databases , 2000, SIGMOD '00.

[75]  Paolo Giorgini,et al.  Goal-oriented requirement analysis for data warehouse design , 2005, DOLAP '05.

[76]  Diego Calvanese,et al.  Conceptual Modeling for Data Integration , 2009, Conceptual Modeling: Foundations and Applications.

[77]  Diego Calvanese,et al.  Reasoning on UML class diagrams , 2005, Artif. Intell..

[78]  Wolfgang Lehner,et al.  Modelling Large Scale OLAP Scenarios , 1998, EDBT.

[79]  Michael J. A. Berry,et al.  Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management , 2004 .

[80]  Ralph Kimball,et al.  The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses , 1996 .

[81]  Alberto Abelló,et al.  Research in data warehouse modeling and design: dead or alive? , 2006, DOLAP '06.

[82]  Luca Cabibbo,et al.  From a procedural to a visual query language for OLAP , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[83]  Torben Bach Pedersen,et al.  Evaluating XML-extended OLAP queries based on a physical algebra , 2004, DOLAP '04.

[84]  David Toman,et al.  On Keys and Functional Dependencies as First-Class Citizens in Description Logics , 2006, IJCAR.

[85]  Luca Cabibbo,et al.  A Logical Approach to Multidimensional Databases , 1998, EDBT.

[86]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[87]  Beate List,et al.  A HOLISTIC APPROACH FOR MANAGINGREQUIREMENTS OF DATA WAREHOUSE SYSTEMS , 2002 .

[88]  Ritu Khare,et al.  SAMSTAR: a semi-automated lexical method for generating star schemas from an entity-relationship diagram , 2007, DOLAP '07.

[89]  Alberto Abelló,et al.  Multidimensional Design by Examples , 2006, DaWaK.

[90]  Andrea Calì,et al.  A Formal Framework for Reasoning on UML Class Diagrams , 2002, ISMIS.

[91]  Alberto Abelló,et al.  Automating multidimensional design from ontologies , 2007, DOLAP '07.

[92]  Boris Vrdoljak,et al.  Designing Web Warehouses from XML Schemas , 2003, DaWaK.