Tuning research tools for scalability and performance: The NiCad experience

Clone detection is a research technique for analyzing software systems for similarities, with applications in software understanding, maintenance, evolution, license enforcement and many other issues. The NiCad near-miss clone detection method has been shown to yield highly accurate results in both precision and recall. However, its naive two-step method, involving a parsing first step to identify and normalize code fragments, followed by a text line-based second step using longest common subsequence (LCS) to compare fragments, has proven difficult to migrate to the efficiency and scalability required for large scale research applications. Rather than presenting the NiCad tool itself in detail, this paper focuses on our experience in migrating NiCad from an initial rapid prototype to a practical scalable research tool. The process has increased overall performance by a factor of up to 40 and clone detection speed by a factor of over 400, while reducing memory and processor requirements to fit on a standard laptop. We apply a sequence of four different kinds of performance optimizations and analyze the effect of each optimization in detail. We believe that the lessons of our experience in migrating NiCad from research prototype to production performance may be beneficial to others who are facing a similar problem.

[1]  Chanchal Kumar Roy,et al.  A Mutation/Injection-Based Automatic Framework for Evaluating Code Clone Detection Tools , 2009, 2009 International Conference on Software Testing, Verification, and Validation Workshops.

[2]  Chanchal Kumar Roy,et al.  Are scripting languages really different? , 2010, IWSC '10.

[3]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[4]  James R. Cordy,et al.  Comprehending reality - practical barriers to industrial adoption of software maintenance automation , 2003, 11th IEEE International Workshop on Program Comprehension, 2003..

[5]  Ira D. Baxter,et al.  Parallel support for source code analysis and modification , 2002, Proceedings. Second IEEE International Workshop on Source Code Analysis and Manipulation.

[6]  Chanchal Kumar Roy,et al.  Scenario-Based Comparison of Clone Detection Techniques , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[7]  Charles L. A. Clarke,et al.  Tokenizer Matcher Action Dispatcher Rule SetOUTPUT Controller Iteration INPUT Iteration Model Figure 1 . Overview : Iterative Lexical Analysis , 2003 .

[8]  James R. Cordy,et al.  The TXL source transformation language , 2006, Sci. Comput. Program..

[9]  Andrew Walenstein,et al.  The Software Similarity Problem in Malware Analysis , 2006, Duplication, Redundancy, and Similarity in Software.

[10]  Chanchal Kumar Roy,et al.  DebCheck: Efficient Checking for Open Source Code Clones in Software Systems , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[11]  Rainer Koschke,et al.  Incremental Clone Detection , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[12]  Chanchal Kumar Roy,et al.  Near-miss function clones in open source software : an empirical study , 2009 .

[13]  Daniel S. Hirschberg,et al.  Algorithms for the Longest Common Subsequence Problem , 1977, JACM.

[14]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[15]  Michael W. Godfrey,et al.  “Cloning considered harmful” considered harmful: patterns of cloning in software , 2008, Empirical Software Engineering.

[16]  Rainer Koschke,et al.  Clone Detection Using Abstract Syntax Suffix Trees , 2006, 2006 13th Working Conference on Reverse Engineering.

[17]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[18]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[19]  Daniel M. Germán,et al.  Code siblings: Technical and legal implications of copying code between applications , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[20]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[21]  Chanchal Kumar Roy,et al.  The NiCad Clone Detector , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[22]  Elmar Jürgens,et al.  Do code clones matter? , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[23]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[24]  James R. Cordy,et al.  Practical language-independent detection of near-miss clones , 2004, CASCON.

[25]  James R. Cordy,et al.  Exploring Large-Scale System Similarity Using Incremental Clone Detection and Live Scatterplots , 2011, 2011 IEEE 19th International Conference on Program Comprehension.