Determining whether arbitrary files are related to known malicious files is often useful in network and host-based defense. Doing so can give network defenders sufficient exemplars of a particular threat to develop comprehensive signatures and heuristics for identifying the threat, leading to decreased response time and improved prevention of a cyber attack. Identifying these malicious families is a complex process involving the categorization of potentially malicious code into sets that share similar features, while being distinguishable from unrelated threats or non-malicious code. Current methods for automatically or manually describing malware families are typically unable to distinguish between indicators derived from the structure of the malware and indicators derived from the behavior of the malware. Further, attempts to cluster potentially related files by mapping them into alternate domains, including histograms, fuzzy hashes, Bloom filters, and so on often produces clusters of files solely derived from structural information. These similarity measurements are often very effective on crudely similar files, yet they fail to identify files that have similar or identical behavior and semantics. We propose an analytic method, driven largely by human experience and based on objective criteria, for assigning arbitrary files membership in a malicious code family. We describe a process for iteratively refining the criteria used to select a malicious code family, until such criteria described are both necessary and sufficient to distinguish a particular malicious code family. We contrast this process with similar processes, such as antivirus signature generation and automatic and blind classification methods. We formalize this process to describe a roadmap for practitioners of malicious code analysis and to highlight opportunities for improvement and automation of both the process and the observation of relevant criteria. Finally, we provide experimental results of applying this methodology to real-world malware.
[1]
Jesse D. Kornblum.
Identifying almost identical files using context triggered piecewise hashing
,
2006,
Digit. Investig..
[2]
Enrique V. Carrera,et al.
Digital genome mapping: ad-vanced binary malware analysis
,
2004
.
[3]
Andrew Walenstein,et al.
Exploiting Similarity Between Variants to Defeat Malware “ Vilo ” Method for Comparing and Searching Binary Programs
,
2007
.
[4]
Kevin Coogan,et al.
Automatic Static Unpacking of Malware Binaries
,
2009,
2009 16th Working Conference on Reverse Engineering.
[5]
Peng Li,et al.
On Challenges in Evaluating Malware Clustering
,
2010,
RAID.
[6]
Carsten Willems,et al.
A Malware Instruction Set for Behavior-Based Analysis
,
2010,
Sicherheit.
[7]
Carsten Willems,et al.
Automatic analysis of malware behavior using machine learning
,
2011,
J. Comput. Secur..
[8]
Joris Kinable,et al.
Malware classification based on call graph clustering
,
2010,
Journal in Computer Virology.
[9]
Peter Szor,et al.
The Art of Computer Virus Research and Defense
,
2005
.
[10]
David Brumley,et al.
BitShred : Fast , Scalable Malware Triage ∗
,
2010
.
[11]
Kang G. Shin,et al.
Large-scale malware indexing using function-call graphs
,
2009,
CCS.
[12]
Zhuoqing Morley Mao,et al.
Automated Classification and Analysis of Internet Malware
,
2007,
RAID.
[13]
Somesh Jha,et al.
Static Analysis of Executables to Detect Malicious Patterns
,
2003,
USENIX Security Symposium.