Active refinement of clone anomaly reports

Software clones have been widely studied in the recent literature and shown useful for finding bugs because inconsistent changes among clones in a clone group may indicate potential bugs. However, many inconsistent clone groups are not real bugs (true positives). The excessive number of false positives could easily impede broad adoption of clone-based bug detection approaches. In this work, we aim to improve the usability of clone-based bug detection tools by increasing the rate of true positives found when a developer analyzes anomaly reports. Our idea is to control the number of anomaly reports a user can see at a time and actively incorporate incremental user feedback to continually refine the anomaly reports. Our system first presents top few anomaly reports from the list of reports generated by a tool in its default ordering. Users then either accept or reject each of the reports. Based on the feedback, our system automatically and iteratively refines a classification model for anomalies and re-sorts the rest of the reports. Our goal is to present the true positives to the users earlier than the default ordering. The rationale of the idea is based on our observation that false positives among the inconsistent clone groups could share common features (in terms of code structure, programming patterns, etc.), and these features can be learned from the incremental user feedback. We evaluate our refinement process on three sets of clone-based anomaly reports from three large real programs: the Linux Kernel (C), Eclipse, and ArgoUML (Java), extracted by a clone-based anomaly detection tool. The results show that compared to the original ordering of bug reports, we can improve the rate of true positives found (i.e., true positives are found faster) by 11%, 87%, and 86% for Linux kernel, Eclipse, and ArgoUML, respectively.

[1]  Katsuro Inoue,et al.  A criterion for filtering code clone related bugs , 2008, DEFECTS '08.

[2]  Michael D. Ernst,et al.  Which warnings should I fix first? , 2007, ESEC-FSE '07.

[3]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[4]  Dawson R. Engler,et al.  Z-Ranking: Using Statistical Analysis to Counter the Impact of Static Analysis Approximations , 2003, SAS.

[5]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[6]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[7]  Martin P. Robillard,et al.  Tracking Code Clones in Evolving Software , 2007, 29th International Conference on Software Engineering (ICSE'07).

[8]  Steven Salzberg,et al.  A Nearest Hyperrectangle Learning Method , 1991, Machine Learning.

[9]  Thomas G. Dietterich,et al.  An Experimental Comparison of the Nearest-Neighbor and Nearest-Hyperrectangle Algorithms , 1995, Machine Learning.

[10]  I.D. Baxter,et al.  DMS/spl reg/: program transformations for practical scalable software evolution , 2004, Proceedings. 26th International Conference on Software Engineering.

[11]  Steven P. Reiss,et al.  Fault localization with nearest neighbor queries , 2003, 18th IEEE International Conference on Automated Software Engineering, 2003. Proceedings..

[12]  Katsuro Inoue,et al.  Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder , 2007, 29th International Conference on Software Engineering (ICSE'07).

[13]  Daqing Hou,et al.  CReN: a tool for tracking copy-and-paste code clones and renaming identifiers consistently in the IDE , 2007, eclipse '07.

[14]  Michael W. Godfrey,et al.  "Cloning Considered Harmful" Considered Harmful , 2006, 2006 13th Working Conference on Reverse Engineering.

[15]  Elmar Jürgens,et al.  Do code clones matter? , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[16]  Jürgen Wolff von Gudenberg,et al.  Clone detection in source code by frequent itemset techniques , 2004, Source Code Analysis and Manipulation, Fourth IEEE International Workshop on.

[17]  Miryung Kim,et al.  An empirical study of code clone genealogies , 2005, ESEC/FSE-13.

[18]  Hoan Anh Nguyen,et al.  Clone-Aware Configuration Management , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[19]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[20]  Andreas Zeller,et al.  Predicting faults from cached history , 2008, ISEC '08.

[21]  Stan Jarzabek,et al.  Using Server Pages to Unify Clones in Web Applications: A Trade-Off Analysis , 2007, 29th International Conference on Software Engineering (ICSE'07).

[22]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[23]  Bin Wang,et al.  Automated support for classifying software failure reports , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[24]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[25]  Christopher W. Pidgeon,et al.  DMS®: Program Transformations for Practical Scalable Software Evolution , 2002, IWPSE '02.

[26]  S. Salzberg,et al.  INSTANCE-BASED LEARNING : Nearest Neighbour with Generalisation , 1995 .

[27]  Jens Krinke,et al.  A Study of Consistent and Inconsistent Changes to Code Clones , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[28]  Andy Podgurski,et al.  Retrieving reusable software by sampling behavior , 1993, TSEM.

[29]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[30]  Zhendong Su,et al.  Context-based detection of clone-related bugs , 2007, ESEC-FSE '07.

[31]  Andrew Zisserman,et al.  Scene Classification Using a Hybrid Generative/Discriminative Approach , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Stan Jarzabek,et al.  Eliminating redundancies with a "composition with adaptation" meta-programming technique , 2003, ESEC/FSE-11.

[33]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[34]  Sarah Smith Heckman,et al.  On establishing a benchmark for evaluating static analysis alert prioritization and classification techniques , 2008, ESEM '08.

[35]  Brenda S. Baker,et al.  Finding Clones with Dup: Analysis of an Experiment , 2007, IEEE Transactions on Software Engineering.

[36]  Jae-Gil Lee,et al.  Mining Discriminative Patterns for Classifying Trajectories on Road Networks , 2011, IEEE Transactions on Knowledge and Data Engineering.

[37]  Songbo Tan,et al.  An effective refinement strategy for KNN text classifier , 2006, Expert Syst. Appl..

[38]  Junfeng Yang,et al.  Scalable and systematic detection of buggy inconsistencies in source code , 2010, OOPSLA.

[39]  Shinji Kusumoto,et al.  A metric-based approach to identifying refactoring opportunities for merging code clones in a Java software system , 2008, J. Softw. Maintenance Res. Pract..

[40]  Gregg Rothermel,et al.  Test Case Prioritization: A Family of Empirical Studies , 2002, IEEE Trans. Software Eng..