Software-Clone Rates in Open-Source Programs Written in C or C++

It is often claimed that duplicated code, also known as software clones, occurs frequently. Different researchers have reported clone rates in the range of 19 and 28%, in extreme cases even 59% for particular systems. It is not clear, however, whether those systems are just outliers. In this paper, we analyze about 7,800 open-source projects written in C or C++, summing up to 240 MSLOC, and measure their clone rates. We use statistical analysis to estimate the means of clone rates in open-source projects. Based on our findings, we could not confirm the high clone rates reported in previous studies as expected averages. Except for small projects including a few copied and modified files, we found rather low clone rates compared to previous studies. For instance, if a minimal clone length of 100 tokens (roughly 16 LOC) is requested, we found an average rate of duplicated type-2 clones of about 12%. Fortype-1 clones of this length, we found an average clone rate of only 1%.However, our results show also that cloning is common. We identified only 20% of the projects to have no type-2 clone of at least 100 tokens. And 44% of the projects have at least one type-1 clone of at least 100 tokens.

[2]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[3]  Chanchal Kumar Roy,et al.  Near-miss function clones in open source software : an empirical study , 2009 .

[4]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[5]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[6]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[7]  Nils Göde,et al.  Evolution of Type-1 Clones , 2009, 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation.

[8]  Ettore Merlo,et al.  Assessing the benefits of incorporating function clone detection in a development process , 1997, 1997 Proceedings International Conference on Software Maintenance.

[9]  Chanchal Kumar Roy,et al.  Near-miss software clones in open source games: An empirical study , 2014, 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE).

[10]  Audris Mockus,et al.  Towards building a universal defect prediction model , 2014, MSR 2014.

[11]  Rainer Koschke,et al.  Large‐scale inter‐system clone detection using suffix trees and hashing , 2014, J. Softw. Evol. Process..

[12]  Sen Zhang,et al.  Two Efficient Algorithms for Linear Time Suffix Array Construction , 2011, IEEE Transactions on Computers.

[13]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[14]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[15]  William F. Smyth,et al.  Fast Optimal Algorithms for Computing All the Repeats in a String , 2008, Stringology.

[16]  Mark Harman,et al.  Searching for better configurations: a rigorous approach to clone evaluation , 2013, ESEC/FSE 2013.

[17]  Rainer Koschke,et al.  Survey of Research on Software Clones , 2006, Duplication, Redundancy, and Similarity in Software.

[18]  Saman Bazrafshan,et al.  Evolution of Near-Miss Clones , 2012, 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation.

[19]  Chanchal Kumar Roy,et al.  Are scripting languages really different? , 2010, IWSC '10.

[20]  Chanchal Kumar Roy,et al.  The vision of software clone management: Past, present, and future (Keynote paper) , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[21]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.