Robust Network-Based Binary-to-Vector Encoding for Scalable IoT Binary File Retrieval

The goal of IoT binary file retrieval is to retrieve homologous binary files from a large IoT binary file database. Binary file retrieval has many applications, such as security analysis, OEM detection and plagiarism detection. However, traditional string-based approaches are hard to retrieve binary file which contains few or obfuscated strings. To solve this problem, we propose a novel neural network-based approach for encoding binary file into numerical vector based on non-string binary features. Moreover, by using this encoding method, the retrieval task can be accelerated by locality-sensitive hashing technique. For network training and testing, we compile 893 open source components into 71,129 labeled binary file pairs by using 16 different compilation configurations. We implement a prototype called B2V and compare it with IHB, a string-based approach, on both original and string obfuscated testing sets. The results show that the AUC of B2V is better than IHB (0.94 vs. 0.81) on the string obfuscated testing set, while still keeps comparable performance with IHB on the original testing set. Moreover, B2V can be easily retrained to adapt to string obfuscated scenarios with 15%–20% performance improvement. In the interest of open science, we also make our dataset publicly available to seed future improvements.

[1]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[2]  Zhipeng Cai,et al.  A Private and Efficient Mechanism for Data Uploading in Smart Cyber-Physical Systems , 2020, IEEE Transactions on Network Science and Engineering.

[3]  Yu Chen,et al.  IHB: A scalable and efficient scheme to identify homologous binaries in IoT firmwares , 2017, 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC).

[4]  Zhenkai Liang,et al.  Neural Nets Can Learn Function Type Signatures From Binaries , 2017, USENIX Security Symposium.

[5]  Vassil Roussev,et al.  Data Fingerprinting with Similarity Digests , 2010, IFIP Int. Conf. Digital Forensics.

[6]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[7]  Yingshu Li,et al.  Data Linkage in Smart Internet of Things Systems: A Consideration from a Privacy Perspective , 2018, IEEE Communications Magazine.

[8]  Fei Huang,et al.  Brief Introduction of Back Propagation (BP) Neural Network Algorithm and Its Improvement , 2012 .

[9]  Robert L. Nord,et al.  Results of SEI Line-Funded Exploratory New Starts Projects , 2012 .

[10]  Sun Limin,et al.  VDNS: An Algorithm for Cross-Platform Vulnerability Searching in Binary Firmware , 2016 .

[11]  Wei Li,et al.  Privacy-preserving combinatorial auction without an auctioneer , 2018, EURASIP J. Wirel. Commun. Netw..

[12]  Aurélien Francillon,et al.  A Large-Scale Analysis of the Security of Embedded Firmwares , 2014, USENIX Security Symposium.

[13]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[14]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[15]  Karl Trygve Kalleberg,et al.  Finding software license violations through binary code clone detection , 2011, MSR '11.

[16]  Palash Goyal,et al.  Graph Embedding Techniques, Applications, and Performance: A Survey , 2017, Knowl. Based Syst..

[17]  Jure Leskovec,et al.  Mining of Massive Datasets, 2nd Ed , 2014 .

[18]  Sencun Zhu,et al.  Program Characterization Using Runtime Values and Its Application to Software Plagiarism Detection , 2015, IEEE Transactions on Software Engineering.

[19]  Jiyong Jang,et al.  Experimental study of fuzzy hashing in malware clustering analysis , 2015 .

[20]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[21]  Jesse D. Kornblum Identifying almost identical files using context triggered piecewise hashing , 2006, Digit. Investig..

[22]  Yi Liang,et al.  Deep Learning Based Inference of Private Information Using Embedded Sensors in Smart Devices , 2018, IEEE Network.