Defining malware families based on analyst insights

Determining whether arbitrary files are related to known malicious files is often useful in network and host-based defense. Doing so can give network defenders sufficient exemplars of a particular threat to develop comprehensive signatures and heuristics for identifying the threat, leading to decreased response time and improved prevention of a cyber attack. Identifying these malicious families is a complex process involving the categorization of potentially malicious code into sets that share similar features, while being distinguishable from unrelated threats or non-malicious code. Current methods for automatically or manually describing malware families are typically unable to distinguish between indicators derived from the structure of the malware and indicators derived from the behavior of the malware. Further, attempts to cluster potentially related files by mapping them into alternate domains, including histograms, fuzzy hashes, Bloom filters, and so on often produces clusters of files solely derived from structural information. These similarity measurements are often very effective on crudely similar files, yet they fail to identify files that have similar or identical behavior and semantics. We propose an analytic method, driven largely by human experience and based on objective criteria, for assigning arbitrary files membership in a malicious code family. We describe a process for iteratively refining the criteria used to select a malicious code family, until such criteria described are both necessary and sufficient to distinguish a particular malicious code family. We contrast this process with similar processes, such as antivirus signature generation and automatic and blind classification methods. We formalize this process to describe a roadmap for practitioners of malicious code analysis and to highlight opportunities for improvement and automation of both the process and the observation of relevant criteria. Finally, we provide experimental results of applying this methodology to real-world malware.