Problems creating task-relevant clone detection reference data

One prevalent method for evaluating the results of automatedsoftware analysis tools is to compare the tools'output to the judgment of human experts. This evaluationstrategy is commonly assumed in the field of software clonedetector research. We report our experiences from a studyusing several human judges who tried to establish "referencesets" of function clones for several medium-sized softwaresystems written in C. The study employed multiplejudges and followed a process typical for inter-coder reliabilityassurance wherein coders discussed classificationdiscrepancies until consensus is reached. A high level ofdisagreement was found for reference sets made specificallyfor reengineering task contexts. The results, although preliminary,raise questions about limitations of prior clonedetector evaluations and other similar tool evaluations. Implicationsare drawn for future work on reference data generation,tool evaluations, and benchmarking efforts.

[1]  Shari Lawrence Pfleeger,et al.  Preliminary Guidelines for Empirical Research in Software Engineering , 2002, IEEE Trans. Software Eng..

[2]  Jean-Francois Girard,et al.  A Metric-Based Approach to Detect Abstract Data Types and State Encapsulations , 2004, Automated Software Engineering.

[3]  Rainer Koschke,et al.  Vergleich von Techniken zur Erkennung duplizierten Quellcodes , 2002 .

[4]  Kostas Kontogiannis,et al.  Evaluation experiments on the detection of programming patterns using software metrics , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[5]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[6]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[7]  R. Rust,et al.  Reliability Measures for Qualitative Data: Theory and Implications , 1994 .

[8]  Rainer Koschke,et al.  A framework for experimental evaluation of clustering techniques , 2000, Proceedings IWPC 2000. 8th International Workshop on Program Comprehension.

[9]  Elizabeth Burd,et al.  Evaluating clone detection tools for use during preventative maintenance , 2002, Proceedings. Second IEEE International Workshop on Source Code Analysis and Manipulation.

[10]  Spiros Mancoridis,et al.  CRAFT: a framework for evaluating software clustering results in the absence of benchmark decompositions [Clustering Results Analysis Framework and Tools] , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[11]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[12]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[13]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[14]  Brenda S. Baker,et al.  A Program for Identifying Duplicated Code , 1992 .

[15]  Yun Yang,et al.  Towards a clone detection benchmark suite and results archive , 2003, 11th IEEE International Workshop on Program Comprehension, 2003..