On the Role of Statistical Significance in Exploratory Data Analysis

Recently, an approach to knowledge discovery, called Attribute Focusing, has been used by software development teams to discover such knowledge from categorical defect data as allows them to improve their process of software development in real time. This feedback is provided by computing the difference of the observed proportion within a selected category from an expected proportion for that category, and then, by studying those differences to identify their causes in the context of the process and product. In this paper, we consider the possibility that some differences may simply have occurred by chance, i.e., as a consequence of some random effect in the process generating the data. We develop an approach based on statistical significance to identify such differences. Preliminary, empirical results are presented which indicate that knowledge of statistical significance should be used carefully when selecting differences to be studied to identify causes. Conventional wisdom would suggest that all differences that lie beyond some small level of statistical significance be eliminated from consideration. Our results show that such elimination is not a good idea. They also show that information on statistical significance can be useful in the process of identifying a cause.

[1]  Inderpal S. Bhandari,et al.  Post-process feedback with and without attribute focusing: a comparative evaluation , 1993, Proceedings of 1993 15th International Conference on Software Engineering.

[2]  Albert Endres An analysis of errors and their causes in system programs , 1975 .

[3]  Albert Endres An Analysis of Errors and Their Causes in System Programs , 1975, IEEE Trans. Software Eng..

[4]  Victor R. Basili,et al.  Developing Interpretable Models with Optimized Set Reduction for Identifying High-Risk Software Components , 1993, IEEE Trans. Software Eng..

[5]  Ronald A. Radice,et al.  A Programming Process Architecture , 1985, IBM Syst. J..

[6]  Anneliese Amschler Andrews,et al.  Software engineering - methods and management , 1990, Int. CMG Conference.

[7]  Vasant Dhar,et al.  Abstract-Driven Pattern Discovery in Databases , 1992, IEEE Trans. Knowl. Data Eng..

[8]  Inderpal S. Bhandari,et al.  Orthogonal Defect Classification - A Concept for In-Process Measurements , 1992, IEEE Trans. Software Eng..

[9]  Willi Klösgen,et al.  A Support System for Interpreting Statistical Data , 1991, Knowledge Discovery in Databases.

[10]  David A. Gustafson,et al.  Shotgun correlations in software measures , 1993, Softw. Eng. J..

[11]  Inderpal S. Bhandari,et al.  In-Process Improvement through Defect Data Interpretation , 1994, IBM Syst. J..

[12]  Watts S. Humphrey,et al.  Managing the software process , 1989, The SEI series in software engineering.

[13]  Inderpal S. Bhandari,et al.  A Case Study of Software Process Improvement During Development , 1993, IEEE Trans. Software Eng..

[14]  Anneliese von Mayrhauser Software Engineering: Methods and Management , 1990 .

[15]  Inderpal Bhandari,et al.  Attribute focusing: machine-assisted knowledge discovery applied to software production process control , 1993 .

[16]  Adam A. Porter,et al.  Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis , 1988, IEEE Trans. Software Eng..

[17]  Philip K. Chan,et al.  Systems for Knowledge Discovery in Databases , 1993, IEEE Trans. Knowl. Data Eng..

[18]  Victor R. Basili,et al.  Tailoring the software process to project goals and environments , 1987, ICSE '87.

[19]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[20]  Inderpal S. Bhandari,et al.  In-Process Evaluation for Software Inspection and Test , 1993, IEEE Trans. Software Eng..