Accurate and Robust Malware Analysis through Similarity of External Calls Dependency Graphs (ECDG)

Malware is a primary concern in cybersecurity, being one of the attacker’s favorite cyberweapons. Over time, malware evolves not only in complexity but also in diversity and quantity. Malware analysis automation is thus crucial. In this paper we present ECDGs, a shorter call graph representation, and a new similarity function that is accurate and robust. Toward this goal, we revisit some principles of malware analysis research to define basic primitives and an evaluation paradigm addressed for the setup of more reliable experiments. Our benchmark shows that our similarity function is very efficient in practice, achieving speedup rates of 3.30x and 354,11x wrt. radiff2 for the standard and the cache-enhanced implementations, respectively. Our evaluations generate clusters that produce almost unerring results - homogeneity score of 0.983 for the accuracy phase - and marginal information loss for a highly polluted dataset - NMI score of 0.974 between initial and final clusters of the robustness phase. Overall, ECDGs and our similarity function enable autonomous frameworks for malware search and clustering that can assist human-based analysis or improve classification models for malware analysis.

[1]  Mansour Ahmadi,et al.  Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification , 2015, CODASPY.

[2]  Jian Xu,et al.  A novel malware variants detection method based On function-call graph , 2013, IEEE Conference Anthology.

[3]  Jo Campling,et al.  Analysis of Variance (ANOVA) , 2002 .

[4]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[5]  Irfan Ul Haq,et al.  A Survey of Binary Code Similarity , 2019, ACM Comput. Surv..

[6]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[7]  Roberto Baldoni,et al.  A Survey of Symbolic Execution Techniques , 2016, ACM Comput. Surv..

[8]  Debin Gao,et al.  BinHunt: Automatically Finding Semantic Differences in Binary Programs , 2008, ICICS.

[9]  Chu-Sing Yang,et al.  An information retrieval approach for malware classification based on Windows API calls , 2013, 2013 International Conference on Machine Learning and Cybernetics.

[10]  Tzi-cker Chiueh,et al.  Automatic Generation of String Signatures for Malware Detection , 2009, RAID.

[11]  Barbara G. Ryder,et al.  Constructing the Call Graph of a Program , 1979, IEEE Transactions on Software Engineering.

[12]  Aziz Mohaisen,et al.  AMAL: High-fidelity, behavior-based automated malware analysis and classification , 2014, Comput. Secur..

[13]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[14]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[15]  Halvar Flake,et al.  Structural Comparison of Executable Objects , 2004, DIMVA.

[16]  Stavros D. Nikolopoulos,et al.  A graph-based model for malware detection and classification using system-call groups , 2017, Journal of Computer Virology and Hacking Techniques.

[17]  Alexander Pretschner,et al.  Maat: Automatically Analyzing VirusTotal for Accurate Labeling and Effective Malware Detection , 2020, ACM Trans. Priv. Secur..

[18]  Mark Stamp,et al.  Deriving common malware behavior through graph clustering , 2013, Comput. Secur..

[19]  Zhuoqing Morley Mao,et al.  Automated Classification and Analysis of Internet Malware , 2007, RAID.

[20]  Michel José Anzanello,et al.  Chemometrics and Intelligent Laboratory Systems , 2009 .

[21]  Steven Skiena,et al.  Implementing discrete mathematics - combinatorics and graph theory with Mathematica , 1990 .

[22]  Debin Gao,et al.  iBinHunt: Binary Hunting with Inter-procedural Control Flow , 2012, ICISC.

[23]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[24]  Jiang Ming,et al.  BinSim: Trace-based Semantic Binary Diffing via System Call Sliced Segment Equivalence Checking , 2017, USENIX Security Symposium.

[25]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[26]  Somesh Jha,et al.  Semantics-aware malware detection , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[27]  Orestis Kostakis,et al.  Classy: fast clustering streams of call-graphs , 2014, Data Mining and Knowledge Discovery.

[28]  Ali Aydin Selçuk,et al.  Undecidable problems in malware analysis , 2017, 2017 12th International Conference for Internet Technology and Secured Transactions (ICITST).

[29]  Joris Kinable,et al.  Malware classification based on call graph clustering , 2010, Journal in Computer Virology.

[30]  Jianguo Jiang,et al.  Based on Multi-features and Clustering Ensemble Method for Automatic Malware Categorization , 2017, 2017 IEEE Trustcom/BigDataSE/ICESS.

[31]  Jian Xu,et al.  Detecting malware variants via function-call graph similarity , 2010, 2010 5th International Conference on Malicious and Unwanted Software.

[32]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[33]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[34]  Svante Wold,et al.  Analysis of variance (ANOVA) , 1989 .

[35]  Christopher Krügel,et al.  Polymorphic Worm Detection Using Structural Information of Executables , 2005, RAID.

[36]  Bazara I. A. Barry,et al.  Enhancing the Detection of Metamorphic Malware using Call Graphs , 2015 .

[37]  Heejo Lee,et al.  Detecting metamorphic malwares using code graphs , 2010, SAC '10.

[38]  Olatz Arbelaitz,et al.  Evaluation of Malware clustering based on its dynamic behaviour , 2008, AusDM.

[39]  Frans Coenen,et al.  A survey of frequent subgraph mining algorithms , 2012, The Knowledge Engineering Review.

[40]  Eunjin Kim,et al.  A Novel Approach to Detect Malware Based on API Call Sequence Analysis , 2015, Int. J. Distributed Sens. Networks.

[41]  Frans Coenen,et al.  Finding Frequent Subgraphs in Longitudinal Social Network Data Using a Weighted Graph Mining Approach , 2010, ADMA.