FOSSIL: A Resilient and Efficient System for Identifying FOSS Functions in Malware Binaries

Identifying free open-source software (FOSS) packages on binaries when the source code is unavailable is important for many security applications, such as malware detection, software infringement, and digital forensics. This capability enhances both the accuracy and the efficiency of reverse engineering tasks by avoiding false correlations between irrelevant code bases. Although the FOSS package identification problem belongs to the field of software engineering, conventional approaches rely strongly on practical methods in data mining and database searching. However, various challenges in the use of these methods prevent existing function identification approaches from being effective in the absence of source code. To make matters worse, the introduction of obfuscation techniques, the use of different compilers and compilation settings, and software refactoring techniques has made the automated detection of FOSS packages increasingly difficult. With very few exceptions, the existing systems are not resilient to such techniques, and the exceptions are not sufficiently efficient. To address this issue, we propose FOSSIL, a novel resilient and efficient system that incorporates three components. The first component extracts the syntactical features of functions by considering opcode frequencies and applying a hidden Markov model statistical test. The second component applies a neighborhood hash graph kernel to random walks derived from control-flow graphs, with the goal of extracting the semantics of the functions. The third component applies z-score to the normalized instructions to extract the behavior of instructions in a function. The components are integrated using a Bayesian network model, which synthesizes the results to determine the FOSS function. The novel approach of combining these components using the Bayesian network has produced stronger resilience to code obfuscation. We evaluate our system on three datasets, including real-world projects whose use of FOSS packages is known, malware binaries for which there are security and reverse engineering reports purporting to describe their use of FOSS, and a large repository of malware binaries. We demonstrate that our system is able to identify FOSS packages in real-world projects with a mean precision of 0.95 and with a mean recall of 0.85. Furthermore, FOSSIL is able to discover FOSS packages in malware binaries that match those listed in security and reverse engineering reports. Our results show that modern malware binaries contain 0.10--0.45 of FOSS packages.

[1]  Angelos Stavrou,et al.  Malware Characterization Using Behavioral Components , 2012, MMM-ACNS.

[2]  Levente Buttyán,et al.  The Cousins of Stuxnet: Duqu, Flame, and Gauss , 2012, Future Internet.

[3]  Lingyu Wang,et al.  BinShape: Scalable and Robust Binary Library Function Identification Using Function Shape , 2017, DIMVA.

[4]  Jean-Yves Marion,et al.  Aligot: cryptographic function identification in obfuscated binary programs , 2012, CCS.

[5]  Xiaohong Su,et al.  Using Reduced Execution Flow Graph to Identify Library Functions in Binary Code , 2016, IEEE Transactions on Software Engineering.

[6]  Wanlei Zhou,et al.  Control Flow-Based Malware VariantDetection , 2014, IEEE Transactions on Dependable and Secure Computing.

[7]  Debin Gao,et al.  BinHunt: Automatically Finding Semantic Differences in Binary Programs , 2008, ICICS.

[8]  Mark Stamp,et al.  Chi-squared distance and metamorphic virus detection , 2013, Journal of Computer Virology and Hacking Techniques.

[9]  Heng Yin,et al.  Scalable Graph-based Bug Search for Firmware Images , 2016, CCS.

[10]  Christian Rossow,et al.  Cross-Architecture Bug Search in Binary Executables , 2015, 2015 IEEE Symposium on Security and Privacy.

[11]  Gail E. Kaiser,et al.  Code relatives: detecting similarly behaving software , 2016, SIGSOFT FSE.

[12]  Heng Yin,et al.  Renovo: a hidden code extractor for packed executables , 2007, WORM '07.

[13]  Andy King,et al.  BinSlayer: accurate comparison of binary executables , 2013, PPREW '13.

[14]  Yijia Zhang,et al.  Hash Subgraph Pairwise Kernel for Protein-Protein Interaction Extraction , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Xiangyu Zhang,et al.  Automatic Reverse Engineering of Data Structures from Binary Execution , 2010, NDSS.

[16]  Mark Stamp,et al.  A Revealing Introduction to Hidden Markov Models , 2017 .

[17]  Yong Chen,et al.  Automatic malware categorization using cluster ensemble , 2010, KDD.

[18]  Arun Lakhotia,et al.  Fast location of similar code fragments using semantic 'juice' , 2013, PPREW '13.

[19]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[20]  Mark Stamp,et al.  Hunting for undetectable metamorphic viruses , 2011, Journal in Computer Virology.

[21]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[22]  Michel van Eeten,et al.  An Empirical Analysis of ZeuS C&C Lifetime , 2015, AsiaCCS.

[23]  Wei Zhang,et al.  Semantics-Based Online Malware Detection: Towards Efficient Real-Time Protection Against Malware , 2016, IEEE Transactions on Information Forensics and Security.

[24]  Priya Narasimhan,et al.  Binary Function Clustering Using Semantic Hashes , 2012, 2012 11th International Conference on Machine Learning and Applications.

[25]  San Jos,et al.  CHI-SQUARED DISTANCE AND METAMORPHIC VIRUS DETECTION , 2012 .

[26]  Yijia Zhang,et al.  Neighborhood hash graph kernel for protein-protein interaction extraction , 2011, J. Biomed. Informatics.

[27]  Susan Horwitz,et al.  The Effects of the Precision of Pointer Analysis , 1997, SAS.

[28]  Joshua Saxe,et al.  CrowdSource: Automated inference of high level malware functionality from low-level symbols using a crowd trained machine learning model , 2014, 2014 9th International Conference on Malicious and Unwanted Software: The Americas (MALWARE).

[29]  Ross J. Anderson,et al.  Rendezvous: A search engine for binary code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[30]  Benjamin C. M. Fung,et al.  BinClone: Detecting Code Clones in Malware , 2014, 2014 Eighth International Conference on Software Security and Reliability.

[31]  David Brumley,et al.  BitShred: feature hashing malware for scalable triage and semantic analysis , 2011, CCS '11.

[32]  David Brumley,et al.  TIE: Principled Reverse Engineering of Types in Binary Programs , 2011, NDSS.

[33]  Pascal Junod,et al.  Obfuscator-LLVM -- Software Protection for the Masses , 2015, 2015 IEEE/ACM 1st International Workshop on Software Protection.

[34]  Christopher Krügel,et al.  Polymorphic Worm Detection Using Structural Information of Executables , 2005, RAID.

[35]  Yuriy Brun,et al.  Using dynamic execution traces and program invariants to enhance behavioral model inference , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[36]  Somesh Jha,et al.  OmniUnpack: Fast, Generic, and Safe Unpacking of Malware , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[37]  Martin Fowler,et al.  Refactoring - Improving the Design of Existing Code , 1999, Addison Wesley object technology series.

[38]  Stefano Zanero,et al.  Lines of malicious code: insights into the malicious software industry , 2012, ACSAC '12.

[39]  Wei Ming Khoo Decompilation as search , 2013 .

[40]  David Brumley,et al.  Towards Automatic Software Lineage Inference , 2013, USENIX Security Symposium.

[41]  Igor Santos,et al.  Using Dalvik Opcodes for Malware Detection on Android , 2015, HAIS.

[42]  Mourad Debbabi,et al.  RESource: A Framework for Online Matching of Assembly with Open Source Code , 2012, FPS.

[43]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[44]  Yang Liu,et al.  BinGo: cross-architecture cross-OS binary search , 2016, SIGSOFT FSE.

[45]  Lingyu Wang,et al.  SIGMA: A Semantic Integrated Graph Matching Approach for identifying reused functions in binary code , 2015, Digit. Investig..

[46]  Christopher Krügel,et al.  Identifying Dormant Functionality in Malware Programs , 2010, 2010 IEEE Symposium on Security and Privacy.

[47]  S. Czepiel,et al.  Maximum Likelihood Estimation of Logistic Regression Models : Theory and Implementation , 2022 .

[48]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Derek Partridge,et al.  Feature ranking and best feature subset using mutual information , 2004, Neural Computing & Applications.

[50]  Daniel Bilar,et al.  Opcodes as predictor for malware , 2007, Int. J. Electron. Secur. Digit. Forensics.

[51]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[52]  Barton P. Miller,et al.  Labeling library functions in stripped binaries , 2011, PASTE '11.

[53]  Benjamin C. M. Fung,et al.  Scalable code clone search for malware analysis , 2015, Digit. Investig..

[54]  David Brumley,et al.  Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components , 2014, USENIX Security Symposium.

[55]  Eric Filiol,et al.  A statistical model for undecidable viral detection , 2007, Journal in Computer Virology.

[56]  Lakshmanan Nataraj,et al.  SARVAM : Search And RetrieVAl of Malware , 2013 .

[57]  Andrew Walenstein,et al.  A transformation-based model of malware derivation , 2012, 2012 7th International Conference on Malicious and Unwanted Software.

[58]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[59]  Arun Lakhotia,et al.  Identifying Shared Software Components to Support Malware Forensics , 2014, DIMVA.

[60]  Zaharije Radivojevic,et al.  Approach for estimating similarity between procedures in differently compiled binaries , 2015, Inf. Softw. Technol..

[61]  Benjamin C. M. Fung,et al.  Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering , 2016, KDD.

[62]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[63]  S. V. N. Vishwanathan,et al.  Graph kernels , 2007 .

[64]  Arun Lakhotia,et al.  FuncTracker: Discovering Shared Code to Aid Malware Forensics , 2013, LEET.

[65]  Lingyu Wang,et al.  BinComp: A stratified approach to compiler provenance Attribution , 2015, Digit. Investig..

[66]  Halvar Flake,et al.  Structural Comparison of Executable Objects , 2004, DIMVA.