Thresholds based outlier detection approach for mining class outliers: An empirical case study on software measurement datasets

Predicting the fault-proneness labels of software program modules is an emerging software quality assurance activity and the quality of datasets collected from previous software version affects the performance of fault prediction models. In this paper, we propose an outlier detection approach using metrics thresholds and class labels to identify class outliers. We evaluate our approach on public NASA datasets from PROMISE repository. Experiments reveal that this novel outlier detection method improves the performance of robust software fault prediction models based on Naive Bayes and Random Forests machine learning algorithms.

[1]  Taghi M. Khoshgoftaar,et al.  Noise identification with the k-means algorithm , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[2]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007 .

[3]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[4]  John C. Munson,et al.  Software Engineering Measurement , 2003 .

[5]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[6]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[7]  Shian-Shyong Tseng,et al.  Two-phase clustering process for outliers detection , 2001, Pattern Recognit. Lett..

[8]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[9]  Banu Diri,et al.  Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem , 2009, Inf. Sci..

[10]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[11]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[12]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[13]  Banu Diri,et al.  A systematic review of software fault prediction studies , 2009, Expert Syst. Appl..

[14]  Taghi M. Khoshgoftaar,et al.  Software quality estimation with limited fault data: a semi-supervised learning perspective , 2007, Software Quality Journal.

[15]  Taghi M. Khoshgoftaar,et al.  Analyzing software measurement data with clustering techniques , 2004, IEEE Intelligent Systems.

[16]  Irena Koprinska,et al.  Learning to classify e-mail , 2007, Inf. Sci..

[17]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.