Binary semantic Similarity Comparison based on software gene

The similarity of binary code is widely used in code copyright protection, vulnerability mining, malicious code analysis and etc. In this paper, we proposed a method for measuring/evaluating the similarity of two binary files based on software genes. Some of Natural language processing methods were introduced into program semantics analysis, including word2vec and doc2vec models to generate assembly instruction embeddings and gene semantic embeddings. Then the longest common subsequence method was applied to evaluate the software similarity. Experiments show that our method can effectively evaluate the similarity of binary files.

[1]  Guanghui Liang,et al.  A Gene-Inspired Malware Detection Approach , 2019, Journal of Physics: Conference Series.

[2]  Sencun Zhu,et al.  Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software and Algorithm Plagiarism Detection , 2017, IEEE Transactions on Software Engineering.

[3]  Jing Wang,et al.  MCSMGS: Malware Classification Model Based on Deep Learning , 2017, 2017 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC).

[4]  Zhenkai Liang,et al.  Neural Nets Can Learn Function Type Signatures From Binaries , 2017, USENIX Security Symposium.

[5]  Christian Rossow,et al.  Cross-architecture bug search in binary executables , 2015, 2015 IEEE Symposium on Security and Privacy.

[6]  Mourad Debbabi,et al.  The Use of NLP Techniques in Static Code Analysis to Detect Weaknesses and Vulnerabilities , 2014, Canadian Conference on AI.

[7]  Debin Gao,et al.  BinHunt: Automatically Finding Semantic Differences in Binary Programs , 2008, ICICS.