Open-Source License Violations of Binary Software at Large Scale

Open-source licenses are widely used in open-source projects. However, developers using or modifying the source code of open-source projects do not always strictly follow the licenses. GPL and AGPL, two of the most popular copyleft licenses, are most likely to be violated, because they require developers to open-source the entire project if any code under GPL/AGPL protection is included whether modified or not. There are few license violation detectors focusing on binary software, owning to the challenge of mapping binary code to source code efficiently and accurately at large scale. In this paper, we propose a scalable and fully-automated system to check open-source license violation of binary software at large scale. We match source code to binary code by analyzing file attributes of executable files and code features that are not affected by compilation and could vary between projects. Moreover, to break the barrier of large-scale analysis, we introduce an automatic extractor to parse executable files from installation packages that are broadly available in software download sites. In empirical experiments of binary-to-source mapping, we have got a remarkable high accuracy of 99.5% and recall of 95.6% without significant loss of precision. Besides, 2270 pairs of binary-to-source mapping relationships are discovered, with 110 license violations of GPL and AGPL licenses related to 7.2% of the 1000 real-world binary software projects.