Finding software license violations through binary code clone detection

Software released in binary form frequently uses third-party packages without respecting their licensing terms. For instance, many consumer devices have firmware containing the Linux kernel, without the suppliers following the requirements of the GNU General Public License. Such license violations are often accidental, e.g., when vendors receive binary code from their suppliers with no indication of its provenance. To help find such violations, we have developed the Binary Analysis Tool (BAT), a system for code clone detection in binaries. Given a binary, such as a firmware image, it attempts to detect cloning of code from repositories of packages in source and binary form. We evaluate and compare the effectiveness of three of BAT's clone detection techniques: scanning for string literals, detecting similarity through data compression, and detecting similarity by computing binary deltas.

[1]  Leon Moonen,et al.  Generating robust parsers using island grammars , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[2]  Daniel M. Germán,et al.  An exploratory study of the evolution of software licensing , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[3]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[4]  Rachel Harrison,et al.  Evolution in software systems: foundations of the SPE classification scheme: Research Articles , 2006 .

[5]  Merijn de Jonge,et al.  Imposing a memory management discipline on software deployment , 2004, Proceedings. 26th International Conference on Software Engineering.

[6]  Zhuoqing Morley Mao,et al.  Automated Classification and Analysis of Internet Malware , 2007, RAID.

[7]  Daniel M. Germán,et al.  Code siblings: Technical and legal implications of copying code between applications , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[8]  Colin Percival Naı̈ve Differences of Executable Code , 2003 .

[9]  Zhendong Su,et al.  A study of the uniqueness of source code , 2010, FSE '10.

[10]  Jeffrey C. Mogul,et al.  The VCDIFF Generic Differencing and Compression Data Format , 2002, RFC.

[11]  Katsuro Inoue,et al.  A sentence-matching method for automatic license identification of source code files , 2010, ASE.

[12]  Michael W. Godfrey,et al.  From Whence It Came: Detecting Source Code Clones by Analyzing Assembler , 2010, 2010 17th Working Conference on Reverse Engineering.

[13]  Daniel M. Germán,et al.  Identifying licensing of jar archives using a code-search approach , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[14]  Oscar Nierstrasz,et al.  On the effectiveness of clone detection by string matching , 2006, J. Softw. Maintenance Res. Pract..

[15]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[16]  Michael W. Godfrey,et al.  “Cloning considered harmful” considered harmful: patterns of cloning in software , 2008, Empirical Software Engineering.

[17]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[18]  Katsuro Inoue,et al.  Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder , 2007, 29th International Conference on Software Engineering (ICSE'07).