Electronic Communications of the EASST Volume 63 ( 2014 ) Proceedings of the Eighth International Workshop on Software Clones ( IWSC 2014 ) Robust Parsing of Cloned Token Sequences

Token-based clone detection techniques are known for their scalability, high recall, and robustness against syntax errors and incomplete code. They, however, may yield clones that are syntactically incomplete and they know very little about the syntactic structure of their reported clones. Hence, their results cannot immediately be used for automated refactorings or syntactic filters for relevance. This paper explores techniques of robust parsing to parse code fragments reported by token-based clone detectors to determine whether the clones are syntactically complete and what kind of syntactic elements they contain. This knowledge can be used to improve the precision of token-based clone detection.

[1]  Sebastian Erdweg,et al.  Variability-aware parsing in the presence of lexical macros and conditional compilation , 2011, OOPSLA '11.

[2]  Jay Earley,et al.  An efficient context-free parsing algorithm , 1970, Commun. ACM.

[3]  Ceriel J. H. Jacobs,et al.  Parsing Techniques - A Practical Guide , 2007, Monographs in Computer Science.

[4]  Alfred V. Aho,et al.  Principles of Compiler Design , 1977 .

[5]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[6]  Rainer Koschke Large-Scale Inter-System Clone Detection Using Suffix Trees , 2012, 2012 16th European Conference on Software Maintenance and Reengineering.

[7]  Rainer Koschke,et al.  Empirical evaluation of clone detection using syntax suffix trees , 2008, Empirical Software Engineering.

[8]  Christian Kästner,et al.  Partial preprocessing C code for variability analysis , 2011, VaMoS.

[9]  Ira D. Baxter,et al.  Preprocessor conditional removal by simple partial evaluation , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[10]  Mark-Jan Nederhof,et al.  Linear-time suffix parsing for deterministic languages , 1996, JACM.

[11]  Iman Keivanloo,et al.  SeClone - A Hybrid Approach to Internet-Scale Real-Time Code Clone Search , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[12]  Katsuro Inoue,et al.  Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder , 2007, 29th International Conference on Software Engineering (ICSE'07).

[13]  Leon Moonen,et al.  Generating robust parsers using island grammars , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[14]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[15]  R. Nigel Horspool,et al.  Practical Earley Parsing , 2002, Comput. J..

[16]  Thierry Lavoie,et al.  Large scale multi-language clone analysis in a telecommunication industrial setting , 2013, 2013 7th International Workshop on Software Clones (IWSC).

[17]  Elmar Jürgens,et al.  Index-based code clone detection: incremental, distributed, scalable , 2010, 2010 IEEE International Conference on Software Maintenance.

[18]  Rainer Koschke,et al.  Clone Detection Using Abstract Syntax Suffix Trees , 2006, 2006 13th Working Conference on Reverse Engineering.

[19]  Alfred V. Aho,et al.  A Minimum Distance Error-Correcting Parser for Context-Free Languages , 1972, SIAM J. Comput..

[20]  Rainer Koschke,et al.  Incremental Clone Detection , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[21]  Brenda S. Baker,et al.  A Program for Identifying Duplicated Code , 1992 .

[22]  Robert Grimm,et al.  SuperC: parsing all of C by taming the preprocessor , 2012, PLDI.