Empirical studies to assess the understandability of data warehouse schemas using structural metrics

Data warehouses are powerful tools for making better and faster decisions in organizations where information is an asset of primary importance. Due to the complexity of data warehouses, metrics and procedures are required to continuously assure their quality. This article describes an empirical study and a replication aimed at investigating the use of structural metrics as indicators of the understandability, and by extension, the cognitive complexity of data warehouse schemas. More specifically, a four-step analysis is conducted: (1) check if individually and collectively, the considered metrics can be correlated with schema understandability using classical statistical techniques, (2) evaluate whether understandability can be predicted by case similarity using the case-based reasoning technique, (3) determine, for each level of understandability, the subsets of metrics that are important by means of a classification technique, and assess, by means of a probabilistic technique, the degree of participation of each metric in the understandability prediction. The results obtained show that although a linear model is a good approximation of the relation between structure and understandability, the associated coefficients are not significant enough. Additionally, classification analyses reveal respectively that prediction can be achieved by considering structure similarity, that extracted classification rules can be used to estimate the magnitude of understandability, and that some metrics such as the number of fact tables have more impact than others.

[1]  Norman E. Fenton,et al.  Software Metrics: A Rigorous Approach , 1991 .

[2]  Dennis Murray,et al.  Data warehousing in the real world - a practical guide for building decision support systems , 1997 .

[3]  Mario Piattini,et al.  An Experimental Replication With Data Warehouse Metrics , 2005, Int. J. Data Warehous. Min..

[4]  Mario Piattini,et al.  Validating metrics for data warehouses , 2002, IEE Proc. Softw..

[5]  Ralph Kimball,et al.  The Data Warehouse Lifecycle Toolkit , 2009 .

[6]  Claes Wohlin,et al.  Using Students as Subjects—A Comparative Study of Students and Professionals in Lead-Time Impact Assessment , 2000, Empirical Software Engineering.

[7]  Stephen R. Gardner Building the data warehouse , 1998, CACM.

[8]  Forrest Shull,et al.  Building Knowledge through Families of Experiments , 1999, IEEE Trans. Software Eng..

[9]  Geert Poels,et al.  DISTANCE: a framework for software measure construction , 1999 .

[10]  Shari Lawrence Pfleeger,et al.  Preliminary Guidelines for Empirical Research in Software Engineering , 2002, IEEE Trans. Software Eng..

[11]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[12]  Shari Lawrence Pfleeger,et al.  Software metrics (2nd ed.): a rigorous and practical approach , 1997 .

[13]  Ad Feelders Introduction to Intelligent Data Analysis , 2003 .

[14]  Sandro Morasca,et al.  Property-Based Software Engineering Measurement , 1996, IEEE Trans. Software Eng..

[15]  Richard Y. Wang,et al.  Quality information and knowledge , 1998 .

[16]  Lionel C. Briand,et al.  A Comprehensive Investigation of Quality Factors in Object-Oriented Designs: an Industrial Case Study , 1998 .

[17]  Paola Sebastiani,et al.  Bayesian methods for intelligent data analysis , 1998 .

[18]  Mario Piattini,et al.  Towards Data Warehouse Quality Metrics , 2001, DMDW.

[19]  W. H. Inmon,et al.  Building the data warehouse (2nd ed.) , 1996 .

[20]  M. Jarke,et al.  Fundamentals of Data Warehouses , 2003, Springer Berlin Heidelberg.

[21]  Horst Zuse,et al.  A Framework of Software Measurement , 1998 .

[22]  Houari A. Sahraoui,et al.  An analogy-based approach for predicting design stability of Java classes , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[23]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[24]  Peter A. Flach,et al.  IBC: A First-Order Bayesian Classifier , 1999, ILP.

[25]  Rokia Missaoui,et al.  Applying Concept Formation Methods to Software Reuse , 1995, Int. J. Softw. Eng. Knowl. Eng..

[26]  Coral Calero,et al.  Information and Database Quality , 2002, Advances in Database Systems.

[27]  Jeffrey C. Carver,et al.  Issues in using students in empirical studies in software engineering education , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[28]  N. E. Schneidewind,et al.  Body of Knowledge for Software Quality Measurement , 2002, Computer.