Predicting Bugs in Software Code Changes Using Isolation Forest

Identifying bug immediately when it is introduced can help improve the validity and effectiveness of bug fixing. Predicting bugs in software code changes makes such identification possible. Buggy changes, changes that introduce bugs into source code, can be viewed as anomalies relative to clean changes for that they are rare and irregular. Thus, anomaly detection techniques can be applied to buggy change prediction. Isolation Forest, which detects anomalies based on the hypothesis that the anomalies have the shortest average path length on the constructed random forest, has exhibited its good performance on anomaly detection compared to other anomaly detection methods. In this paper, we adopt it in predicting bugs in software code changes. Empirical study with eight practical open source projects are conducted to validate the effective of Isolation Forest in bug prediction in software code changes. Results of the empirical study show that compared to traditional classification methods used in literature, Isolation Forest can achieve better clean precision, buggy recall, buggy F-measure, AUC and Gmean.

[1]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[2]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[3]  Yi Zhang,et al.  Classifying Software Changes: Clean or Buggy? , 2008, IEEE Transactions on Software Engineering.

[4]  Ruchika Malhotra,et al.  Defect Collection and Reporting System for Git based Open Source Software , 2014, 2014 International Conference on Data Mining and Intelligent Computing (ICDMIC).

[5]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[6]  Thomas Zimmermann,et al.  When do changes induce fixes? On Fridays , 2005 .

[7]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[8]  Zengyou He,et al.  A Unified Subspace Outlier Ensemble Framework for Outlier Detection , 2005, WAIM.

[9]  Sunghun Kim,et al.  Reducing Features to Improve Code Change-Based Bug Prediction , 2013, IEEE Transactions on Software Engineering.

[10]  Alexander S. Szalay,et al.  Very Fast Outlier Detection in Large Multidimensional Data Sets , 2002, DMKD.

[11]  Biao Huang,et al.  Density Based Outlier Mining Algorithm with Application to Intrusion Detection , 2008, 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application.

[12]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[13]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[14]  Adam Kowalczyk,et al.  Second Order Features for Maximising Text Classification Performance , 2001, ECML.

[15]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[16]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[17]  Zhu Qing-sheng An improved density-based outlier mining algorithm , 2007 .

[18]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[19]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[20]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[21]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[22]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[23]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[24]  R. Ciupa,et al.  International Conference , 2023, In Vitro Cellular & Developmental Biology - Animal.

[25]  Michael T. Goodrich,et al.  Education forum: Web Enhanced Textbooks , 1998, SIGA.

[26]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[27]  Leonid Portnoy,et al.  Intrusion detection with unlabeled data using clustering , 2000 .

[28]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[29]  Clara Pizzuti,et al.  Detection and prediction of distance-based outliers , 2005, SAC '05.

[30]  Joseph E. Beck,et al.  Naive Bayes Classifiers for User Modeling , 1999 .

[31]  Charles X. Ling,et al.  AUC: A Better Measure than Accuracy in Comparing Learning Algorithms , 2003, Canadian Conference on AI.

[32]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[33]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.