Developers often copy code for parts or entire products to start a new product or a new release. In order to understand the software change history and to determine the code authorship, we propose to construct a universal version history from multiple version control repositories. To that end we create two practical code copy detection methods at the level of the source code file: prefix-postfix algorithm and prefix algorithm. The full pathname of a file and its version history are used to construct the universal version history of a file by linking together change histories of files that had the same code at any point in the past. The assumption of both algorithms is that developers often duplicate files by copying entire directories. Once the copying is identified we propose an algorithm to link version histories from multiple repositories in order to construct universal version history. The results show that about 41.32% of source files (in the repository involving more than 6M versions of around 2M files) were duplicated among the Avaya's source code repositories for more than ten different projects. The prefix-postfix algorithm is more suitable than prefix algorithm due to the reasonable error rates after validation of the known copying behaviors.
[1]
Stéphane Ducasse,et al.
A language independent approach for detecting duplicated code
,
1999,
Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).
[2]
Michael W. Godfrey,et al.
Improved tool support for the investigation of duplication in software
,
2005,
21st IEEE International Conference on Software Maintenance (ICSM'05).
[3]
Ettore Merlo,et al.
Assessing the benefits of incorporating function clone detection in a development process
,
1997,
1997 Proceedings International Conference on Software Maintenance.
[4]
Shinji Kusumoto,et al.
CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code
,
2002,
IEEE Trans. Software Eng..
[5]
Brenda S. Baker,et al.
On finding duplication and near-duplication in large software systems
,
1995,
Proceedings of 2nd Working Conference on Reverse Engineering.
[6]
Akito Monden,et al.
Software quality analysis by code clones in industrial legacy software
,
2002,
Proceedings Eighth IEEE Symposium on Software Metrics.