Extracting classification rule of software diagnosis using modified MEPA

Defective software modules cause software failures, increase development and maintenance costs, and reduce customer satisfaction. Effective defect prediction models can help developers focus quality assurance activities on defect-prone modules and thus improve software quality by using resources more efficiently. In real-world databases are highly susceptible to noisy, missing, and inconsistent data. Noise is a random error or variance in a measured variable [Han, J., & Kamber, M. (2001). Data Mining: Concepts and Techniques, San Francisco: Morgan Kaufmann Publishers]. When decision trees are built, many of the branches may reflect noisy or outlier data. Therefore, data preprocessing steps are very important. There are many methods for data preprocessing. Concept hierarchies are a form of data discretization that can use for data preprocessing. Data discretization has many advantages, such as data can be reduced and simplified. Using discrete features are usually more compact, shorter and more accurate than using continuous ones [Liu, H., Hussain, F., Tan, C.L., & Dash, M. (2002). Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4), 393-423]. In this paper, we propose a modified minimize entropy principle approach and develop a modified MEPA system to partition the data, and then build the classification tree model. For verification, two NASA software projects KC2 and JM1 are applied to illustrate our proposed method. We establish a prototype system to discrete data from these projects. The error rate and number of rules show that the proposed approach is both better than other methods.

[1]  Jiawei Han,et al.  Knowledge Discovery in Databases: An Attribute-Oriented Approach , 1992, VLDB.

[2]  Taghi M. Khoshgoftaar,et al.  Genetic programming-based decision trees for software quality classification , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  Shari Lawrence Pfleeger,et al.  Software Metrics : A Rigorous and Practical Approach , 1998 .

[5]  Taghi M. Khoshgoftaar,et al.  Detecting noisy instances with the rule-based classification model , 2005, Intell. Data Anal..

[6]  T. Ross Fuzzy Logic with Engineering Applications , 1994 .

[7]  Ronald R. Yager,et al.  Template-Based Fuzzy Systems Modeling , 1994, J. Intell. Fuzzy Syst..

[8]  Sankar K. Pal,et al.  Data mining in soft computing framework: a survey , 2002, IEEE Trans. Neural Networks.

[9]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[10]  Norman E. Fenton,et al.  Software Metrics: A Rigorous Approach , 1991 .

[11]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[12]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[13]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[14]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[15]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[16]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.