On Challenges in Evaluating Malware Clustering

Malware clustering and classification are important tools that enable analysts to prioritize their malware analysis efforts. The recent emergence of fully automated methods for malware clustering and classification that report high accuracy suggests that this problem may largely be solved. In this paper, we report the results of our attempt to confirm our conjecture that the method of selecting ground-truth data in prior evaluations biases their results toward high accuracy. To examine this conjecture, we apply clustering algorithms from a different domain (plagiarism detection), first to the dataset used in a prior work's evaluation and then to a wholly new malware dataset, to see if clustering algorithms developed without attention to subtleties of malware obfuscation are nevertheless successful. While these studies provide conflicting signals as to the correctness of our conjecture, our investigation of possible reasons uncovers, we believe, a cautionary note regarding the significance of highly accurate clustering results, as can be impacted by testing on a dataset with a biased cluster-size distribution.

[1]  Michael J. Wise Detection of similarities in student programs: YAP'ing may be preferable to plague'ing , 1992, SIGCSE '92.

[2]  Zhenkai Liang,et al.  BitBlaze: A New Approach to Computer Security via Binary Analysis , 2008, ICISS.

[3]  Akito Monden,et al.  Dynamic Software Birthmarks to Detect the Theft of Windows Applications , 2004 .

[4]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[5]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[6]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[7]  Sencun Zhu,et al.  Detecting Software Theft via System Call Based Birthmarks , 2009, 2009 Annual Computer Security Applications Conference.

[8]  Pavel Laskov,et al.  Detection of Intrusions and Malware, and Vulnerability Assessment: 19th International Conference, DIMVA 2022, Cagliari, Italy, June 29 –July 1, 2022, Proceedings , 2022, International Conference on Detection of intrusions and malware, and vulnerability assessment.

[9]  G. Whale Indentification of Program Similarity in Large Populations , 1990, Comput. J..

[10]  Nick Feamster,et al.  Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces , 2010, NSDI.

[11]  U. Bayer,et al.  TTAnalyze: A Tool for Analyzing Malware , 2006 .

[12]  Marius Gheorghescu AN AUTOMATED VIRUS CLASSIFICATION SYSTEM , 2006 .

[13]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[14]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[15]  Felix C. Freiling,et al.  Toward Automated Dynamic Malware Analysis Using CWSandbox , 2007, IEEE Secur. Priv..

[16]  Kang G. Shin,et al.  Large-scale malware indexing using function-call graphs , 2009, CCS.

[17]  Zhuoqing Morley Mao,et al.  Automated Classification and Analysis of Internet Malware , 2007, RAID.