Using Association Rules to Identify Similarities between Software Datasets

A number of V&V datasets are publicly available. These datasets have software measurements and defectiveness information regarding the software modules. To facilitate V&V, numerous defect prediction studies have used these datasets and have detected defective modules effectively. Software developers and managers can benefit from the existing studies to avoid analogous defects and mistakes if they are able to find similarity between their software and the software represented by the public datasets. This paper identifies the similar datasets by comparing association patterns in the datasets. The proposed approach finds association rules from each dataset and identifies the overlapping rules from the 100 strongest rules from each of the two datasets being compared. Afterwards, average support and average confidence of the overlap is calculated to determine the strength of the similarity between the datasets. This study compares eight public datasets and results show that KC2 and PC2 have the highest similarity 83% with 97% support and 100% confidence. Datasets with similar attributes and almost same number of attributes have shown higher similarity than the other datasets.

[1]  Alain Abran,et al.  Evaluating software project similarity by using linguistic quantifier guided aggregations , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[2]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[3]  Shenghuo Zhu,et al.  Association-based similarity testing and its applications , 2003, Intell. Data Anal..

[4]  Mian M. Awais,et al.  Nomenclature unification of software product measures , 2011, IET Softw..

[5]  Ana Regina Cavalcanti da Rocha,et al.  Analyzing the Similarity among Software Projects to Improve Software Project Monitoring Processes , 2010, 2010 Seventh International Conference on the Quality of Information and Communications Technology.

[6]  Alain Abran,et al.  A fuzzy logic based set of measures for software project similarity: validation and possible improvements , 2001, Proceedings Seventh International Software Metrics Symposium.

[7]  Peter I. Cowling,et al.  Software Project Similarity Measurement Based on Fuzzy C-Means , 2008, ICSP.

[8]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[9]  Damian Dudek,et al.  Measures for Comparing Association Rule Sets , 2010, ICAISC.

[10]  Srinivasan Parthasarathy,et al.  Exploiting Dataset Similarity for Distributed Mining , 2000, IPDPS Workshops.

[11]  Stephen R. Garner,et al.  WEKA: The Waikato Environment for Knowledge Analysis , 1996 .