BCFinder: A Lightweight and Platform-Independent Tool to Find Third-Party Components in Binaries

Open source movement boosts several open source communities and millions of open source repositories (repos) are available on these communities. Consequently, component-based development and code reuse greatly improve the efficiency of software development. However, they can also bring some problems, such as license violation and security weaknesses. While code reuse detection has been extensively studied in source form, third-party components detection for software in binary form especially based on large scale database like Github has been less researched. In this paper, we take a series of data cleaning processes to get filtered 22K C/C++ repos on Github. We extend the code reuse detection for binaries against such a large-scale data set and design a system called BCFinder as an assistant tool for binary analysis. BCFinder finds third-party components in binaries automatically by feature matching. We evaluate BCFinder with a number of real-word binary programs across platform and compiling configurations. Experiments show that BCFinder is an effective supplementary tool for binary analysis. BCFinder is, by far, the first lightweight, rapid and platform-independent tool to detect component reuse in binaries against a large-scale data base like Github.

[1]  Katsuro Inoue,et al.  Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder , 2007, 29th International Conference on Software Engineering (ICSE'07).

[2]  Dietmar Seipel,et al.  Clone detection in source code by frequent itemset techniques , 2004 .

[3]  David Brumley,et al.  Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components , 2014, USENIX Security Symposium.

[4]  Karl Trygve Kalleberg,et al.  Finding software license violations through binary code clone detection , 2011, MSR '11.

[5]  Erik Derr,et al.  Reliable Third-Party Library Detection in Android and its Security Applications , 2016, CCS.

[6]  C. Tappert,et al.  A Survey of Binary Similarity and Distance Measures , 2010 .

[7]  Foutse Khomh,et al.  On the Detection of Licenses Violations in the Android Ecosystem , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[8]  Min Wang,et al.  CCSharp: An Efficient Three-Phase Code Clone Detector Using Modified PDGs , 2017, 2017 24th Asia-Pacific Software Engineering Conference (APSEC).

[9]  Michael W. Godfrey,et al.  Software bertillonage: finding the provenance of an entity , 2011, MSR '11.

[10]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[11]  Wenke Lee,et al.  Identifying Open-Source License Violation and 1-day Security Risk at Large Scale , 2017, CCS.

[12]  Jan Vitek,et al.  DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[13]  Juanru Li,et al.  Binary Code Clone Detection across Architectures and Compiling Configurations , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[14]  Zhen Huang,et al.  BinPro: A Tool for Binary Source Code Provenance , 2017, ArXiv.

[15]  Cristina V. Lopes,et al.  SourcererCC and SourcererCC-I: Tools to Detect Clones in Batch Mode and during Software Development , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[16]  Antti Väyrynen Finding third-party components with binary analysis , 2014 .

[17]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[18]  Daniela E. Damian,et al.  The promises and perils of mining GitHub , 2009, MSR 2014.

[19]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[20]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[21]  Ahmed E. Hassan,et al.  Understanding reuse in the Android Market , 2012, 2012 20th IEEE International Conference on Program Comprehension (ICPC).

[22]  Yuan Zhang,et al.  Detecting third-party libraries in Android applications with high precision and recall , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[23]  Marco Tulio Valente,et al.  Understanding the Factors That Impact the Popularity of GitHub Repositories , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[24]  Ross J. Anderson,et al.  Rendezvous: A search engine for binary code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[25]  Sheng Bi,et al.  A New Method of Software Clone Detection Based on Binary Instruction Structure Analysis , 2012 .