Aspects of Data Modeling and Query Processing for Complex Multidimensional Data

This thesis is about data modeling and query processing for complex multidimensional data. Multidimensional data has become the subject of much attention in both academia and industry in recent years, fueled by the popularity of data warehousing and On-Line Analytical Processing (OLAP) applications. One application area where complex multidimensional data is common is within medical informatics, an area that may benefit significantly from the functionality offered by data warehousing and OLAP. However, the special nature of clinical applications poses different and new requirements to data warehousing technologies, over those posed by conventional data warehouse applications. This thesis presents a number of exciting new research challenges posed by clinical applications, to be met by the database research community. These include the need for complex-data modeling features, advanced temporal support, advanced classification structures, continuously valued data, dimensionally reduced data, and the integration of complex data. OLAP systems typically employ multidimensional data models to structure their data. This thesis identifies eleven modeling requirements for multidimensional data models. These requirements are derived from a realistic assessment of complex data found in real-world applications. A survey of twelve multidimensional data models reveals shortcomings in meeting some of the requirements. Existing models do not support many-to-many relationships between facts and dimensions, do not have built-in mechanisms for handling change and time, lack support for imprecision, and are unable to insert data with varying granularities. Additionally, most of the models do not support irregular dimension hierarchies and aggregation semantics. This thesis defines an extended multidimensional data model and algebraic query language that address all eleven requirements. The model reuses the common multidimensional concepts of dimension hierarchies and granularities to capture imprecise data. For queries that cannot be answered precisely due to the imprecise data, techniques are proposed that take into account the imprecision in the grouping of the data, in the subsequent aggregate computation, and in the presentation of the imprecise result to the user. In addition, alternative queries unaffected by imprecision are offered. The presented data model and query evaluation techniques can be implemented using relational database technology. The approach is also capable of exploiting multidimensional query processing techniques like pre-aggregation. This yields a practical solution with low computational overhead. Pre-aggregation, the prior materialization of aggregate queries for later use, is an essential technique for ensuring adequate response time during data analysis. Full pre-aggregation, where all combinations of aggregates are materialized, is infeasible. Instead, modern OLAP systems adopt the practical pre-aggregation approach of materializing only select combinations of aggregates and then re-use these for efficiently computing other aggregates. However, this re-use of aggregates is contingent on the dimension hierarchies and the relationships between facts and dimensions satisfying stringent constraints. This severely limits the scope of the practical pre-aggregation approach. This thesis significantly extends the scope of practical pre-aggregation to

[1]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[2]  Christian S. Jensen,et al.  On the semantics of “now” in databases , 1996, TODS.

[3]  Timos K. Sellis,et al.  Data Warehouse Configuration , 1997, VLDB.

[4]  Laks V. S. Lakshmanan,et al.  What can Hierarchies do for Data Warehouses? , 1999, VLDB.

[5]  Timos K. Sellis,et al.  A survey of logical models for OLAP databases , 1999, SGMD.

[6]  Torben Bach Pedersen,et al.  Research issues in clinical data warehousing , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[7]  Arie Shoshani,et al.  OLAP and statistical databases: similarities and differences , 1997, PODS '97.

[8]  WongEugene A statistical approach to incomplete information in database systems , 1982 .

[9]  Wolfgang Lehner,et al.  A Redundancy-Based Optimization Approach for Aggregation in Multidimensional Scientific and Atatistical Databases , 1997, DASFAA.

[10]  Curtis E. Dyreson,et al.  A Glossary of Time Granularity Concepts , 1997, Temporal Databases, Dagstuhl.

[11]  Arie Shoshani,et al.  STORM: A Statistical Object Representation Model , 1990, IEEE Data Eng. Bull..

[12]  Chang Li,et al.  A data model for supporting on-line analytical processing , 1996, CIKM '96.

[13]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[14]  Ramez Elmasri,et al.  Fundamentals of Database Systems, 2nd Edition , 1994 .

[15]  Elke A. Rundensteiner,et al.  Aggregates in Possibilistic Databases , 1989, VLDB.

[16]  T. R. Harrison Principles of internal medicine , 1955 .

[17]  Luca Cabibbo,et al.  Querying Multidimensional Databases , 1997, DBPL.

[18]  Alan R. Simon,et al.  Understanding the New SQL: A Complete Guide , 1993 .

[19]  H. Toutenburg,et al.  Rubin, D.B.: Multiple imputation for nonresponse in surveys , 1990 .

[20]  Christian S. Jensen,et al.  A foundation for capturing and querying complex multidimensional data , 2001, Inf. Syst..

[21]  Sunita Sarawagi,et al.  Modeling multidimensional databases , 1997, Proceedings 13th International Conference on Data Engineering.

[22]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[23]  Torben Bach Pedersen,et al.  Supporting imprecision in multidimensional databases using granularities , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[24]  Sushil Jajodia,et al.  Temporal Databases: Research and Practice , 1998 .

[25]  Elena Baralis,et al.  Materialized Views Selection in a Multidimensional Database , 1997, VLDB.

[26]  Elke A. Rundensteiner,et al.  Evaluating aggregates in possibilistic relational databases , 1992, Data Knowl. Eng..

[27]  Arbee L. P. Chen,et al.  Evaluating Aggregate Operations Over Imprecise Data , 1996, IEEE Trans. Knowl. Data Eng..

[28]  Paul T. Murphy,et al.  An Architecture for a Business and Information System , 1988, IBM Syst. J..

[29]  Panos Vassiliadis,et al.  Modeling multidimensional databases, cubes and cube operations , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[30]  Thierry Barsalou,et al.  M(DM): an open framework for interoperation of multimodel multidatabase systems , 1992, [1992] Eighth International Conference on Data Engineering.

[31]  Inderpal Singh Mumick,et al.  Selection of Views to Materialize Under a Maintenance Cost Constraint , 1999, ICDT.

[32]  Arie Shoshani,et al.  Extending OLAP querying to external object databases , 2000, CIKM '00.

[33]  I-Min A Chen,et al.  An Overview of the Object-Protocol Model (OPM) and OPM Data Management Tools , 1995, Inf. Syst..

[34]  Clement T. Yu,et al.  Efficient Management of Materialized Generalized Transitive Closure in Centralized and Parallel Environments , 1992, IEEE Trans. Knowl. Data Eng..

[35]  Rakesh Agrawal,et al.  An access structure for generalized transitive closure queries , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[36]  Richard T. Snodgrass,et al.  The TSQL2 Temporal Query Language , 1995 .

[37]  Curtis E. Dyreson,et al.  A Bibliography on Uncertainty Management in Information Systems , 1996, Uncertainty Management in Information Systems.

[38]  Inderpal Singh Mumick,et al.  Maintenance of data cubes and summary tables in a warehouse , 1997, SIGMOD '97.

[39]  Ramez Elmasri,et al.  A consensus glossary of temporal database concepts , 1994, SGMD.

[40]  David Jordan,et al.  The Object Database Standard: ODMG 2.0 , 1997 .

[41]  Arie Shoshani,et al.  OLAP++: Powerful and Easy-to-Use Federations of OLAP and Object Databases , 2000, VLDB.

[42]  E. F. Codd,et al.  Extending the database relational model to capture more meaning , 1979, ACM Trans. Database Syst..

[43]  Curtis E. Dyreson,et al.  Supporting valid-time indeterminacy , 1998, TODS.

[44]  Doron Rotem,et al.  Random sampling from databases: a survey , 1995 .

[45]  Christian S. Jensen,et al.  Semantics of Time-Varying Information , 1996, Inf. Syst..

[46]  E. F. Codd,et al.  Extending the data base relational model to capture more meaning , 1979, SIGMOD '79.

[47]  Christian S. Jensen,et al.  On the Semantics of , 1996 .

[48]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[49]  Stef van Buuren,et al.  Routine multiple imputation in statistical databases , 1994, Seventh International Working Conference on Scientific and Statistical Database Management.

[50]  Curtis E. Dyreson,et al.  Information Retrieval from an Incomplete Data Cube , 1996, VLDB.

[51]  Amihai Motro,et al.  Uncertainty Management in Information Systems: From Needs to Solution , 1996 .

[52]  Shin-Chung Shao Multivariate and Multidimensional OLAP , 1998, EDBT.

[53]  Eugene Wong,et al.  A statistical approach to incomplete information in database systems , 1982, TODS.

[54]  Laks V. S. Lakshmanan,et al.  A Foundation for Multi-dimensional Databases , 1997, VLDB.

[55]  Ashish Gupta,et al.  Aggregate-Query Processing in Data Warehousing Environments , 1995, VLDB.

[56]  Jennifer Widom,et al.  Research problems in data warehousing , 1995, CIKM '95.

[57]  W. H. Inmon,et al.  Building the data warehouse (2nd ed.) , 1996 .

[58]  Divesh Srivastava,et al.  Answering Queries with Aggregation Using Views , 1996, VLDB.

[59]  Erik Thomsen,et al.  OLAP Solutions - Building Multidimensional Information Systems , 1997 .

[60]  Laurian M. Chirica,et al.  The entity-relationship model: toward a unified view of data , 1975, SIGF.

[61]  Laks V. S. Lakshmanan,et al.  nD-SQL: A Multi-Dimensional Language for Interoperability and OLAP , 1998, VLDB.

[62]  David K. Hsiao,et al.  The multimodel, multilingual approach to interoperability of multidatabase systems , 1991, [1991] Proceedings. First International Workshop on Interoperability in Multidatabase Systems.

[63]  W. H. Inmon,et al.  Building the Operational Data Store , 1995 .

[64]  Anthony C. Klug Equivalence of Relational Algebra and Relational Calculus Query Languages Having Aggregate Functions , 1982, JACM.

[65]  Stephen R. Gardner Building the data warehouse , 1998, CACM.

[66]  Klaus R. Dittrich,et al.  An overview and classification of mediated query systems , 1999, SGMD.

[67]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[68]  Jian Yang,et al.  Algorithms for Materialized View Design in Data Warehousing Environment , 1997, VLDB.

[69]  Torben Bach Pedersen,et al.  Multidimensional data modeling for complex data , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[70]  Erol Gelenbe,et al.  A probability model of uncertainty in data bases , 1986, 1986 IEEE Second International Conference on Data Engineering.

[71]  Christian S. Jensen,et al.  Systematic Change Management in Dimensional Data Warehousing , 1998 .

[72]  Christian S. Jensen,et al.  Unifying Temporal Data Models via a Conceptual Model , 1994, Inf. Syst..

[73]  Torben Bach Pedersen,et al.  The TreeScape System: Reuse of Pre-Computed Aggregates over Irregular OLAP Hierarchies , 2000, VLDB.

[74]  Ramez Elmasri,et al.  The Consensus Glossary of Temporal Database Concepts - February 1998 Version , 1997, Temporal Databases, Dagstuhl.

[75]  Jeffrey F. Naughton,et al.  Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies , 1996, VLDB.

[76]  Jennifer Widom,et al.  On-Line Warehouse View Maintenance for Batch Updates , 1996, SIGMOD 1996.

[77]  Jeffrey D. Ullman,et al.  Index selection for OLAP , 1997, Proceedings 13th International Conference on Data Engineering.

[78]  Maurizio Rafanelli,et al.  Proposal of a Logical Model for Statistical Data Base , 1983, SSDBM.

[79]  Wolfgang Lehner,et al.  Modelling Large Scale OLAP Scenarios , 1998, EDBT.

[80]  Jeffrey F. Naughton,et al.  Letter from the Special Issue Editor , 1997, IEEE Data Eng. Bull..

[81]  Torben Bach Pedersen,et al.  Extending Practical Pre-Aggregation in On-Line Analytical Processing , 1999, VLDB.

[82]  Christian S. Jensen,et al.  Temporal Databases: Research and Practice , 1998, Lecture Notes in Computer Science.

[83]  Anindya Datta,et al.  A Conceptual Model and Algebra for On-Line Analytical Processing in Decision Support Databases , 2001, Inf. Syst. Res..

[84]  Arie Shoshani,et al.  Summarizability in OLAP and statistical data bases , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[85]  Arie Segev,et al.  A consensus glossary of temporal database concepts , 1994, SIGMOD 1994.

[86]  Timothy A. Budd,et al.  Multiparadigm programming in Leda , 1994 .