Application of K-Medoids with Kd-Tree for Software Fault Prediction

Software fault prediction area is subject to problems like non availability of fault data which makes the application of supervised techniques difficult. In such cases unsupervised approaches like clustering are helpful. In this paper, K-Medoids clustering approach has been applied for software fault prediction. To overcome the inherent computational complexity of KMedoids algorithm a data structure called Kd-Tree has been used to identify data agents in the datasets. Partitioning Around Medoids is applied on these data agents and this results in a set of medoids. All the remaining data points are assigned to the nearest medoids thus obtained to get the final clusters. Software fault prediction error analysis results show that our approach outperforms all unsupervised approaches in the case of one given real dataset and gives best values for the evaluation parameters. For other real datasets, our results are comparable to other techniques. Performance evaluation of our technique with other techniques has been done. Results show that our technique reduces the total number of distance calculations drastically since the number of data agents is much less than the number of data points.

[1]  Taghi M. Khoshgoftaar,et al.  Analyzing software measurement data with clustering techniques , 2004, IEEE Intelligent Systems.

[2]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[3]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[4]  Taghi M. Khoshgoftaar,et al.  Analyzing software quality with limited fault-proneness defect data , 2005, Ninth IEEE International Symposium on High-Assurance Systems Engineering (HASE'05).

[5]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[6]  Banu Diri,et al.  Software Fault Prediction of Unlabeled Program Modules , 2009 .

[7]  Stephen J. Redmond,et al.  A method for initialising the K-means clustering algorithm using kd-trees , 2007, Pattern Recognit. Lett..

[8]  Vandana Bhattacherjee,et al.  Software Effort Prediction - A Fuzzy Logic Approach , 2013 .

[9]  Sanjay Kumar,et al.  COMPLEXITY METRIC FOR ANALOGY BASED EFFORT ESTIMATION , 2009 .

[10]  Michael J. Prietula,et al.  Case-Based Reasoning in Software Effort estimation , 1990, International Conference on Interaction Sciences.

[11]  Mohamed S. Kamel,et al.  Efficient Bisecting k-Medoids and Its Application in Gene Expression Analysis , 2008, ICIAR.

[12]  Taghi M. Khoshgoftaar,et al.  Software quality classification modeling using the SPRINT decision tree algorithm , 2002, 14th IEEE International Conference on Tools with Artificial Intelligence, 2002. (ICTAI 2002). Proceedings..

[13]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[14]  Radu Marinescu,et al.  Detection strategies: metrics-based rules for detecting design flaws , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[15]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[16]  Ding-Zhu Du,et al.  A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering , 2003, J. Glob. Optim..

[17]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[18]  Agma J. M. Traina,et al.  An Efficient Approach to Scale up k-medoid based Algorithms in Large Databases , 2006, SBBD.

[19]  C. A. Murthy,et al.  Maxdiff kd-trees for data condensation , 2006, Pattern Recognit. Lett..

[20]  Taghi M. Khoshgoftaar,et al.  Unsupervised learning for expert-based software quality estimation , 2004, Eighth IEEE International Symposium on High Assurance Systems Engineering, 2004. Proceedings..

[21]  Khaled El Emam,et al.  Comparing case-based reasoning classifiers for predicting high risk software components , 2001, J. Syst. Softw..