Malware Similarity Identification Using Call Graph Based System Call Subsequence Features

Recent literature has proposed approaches to detect code-sharing relationships between malware artifacts, which helps to accelerate the malware reverse engineering process. In this paper we propose a novel code-sharing analysis technique that can complement existing methods. Our algorithm partitions malware system call logs into system call subsequences by identifying places in these logs where the set of saved instruction pointers on the program call stack changes significantly. The extracted subsequences thus reflect subsequences of system calls that occur in local regions of the program call graph. Having extracted subsequences, we then use the subsequences as features for computing a malware sample similarity matrix. A unique contribution of our method is that it incorporates sequence information into the features it uses to perform similarity analysis, but unlike previously proposed longest common substring methods it runs in linear time. Similarly, our method incorporates call stack information into its features but is computationally far more tractable than previously proposed call graph isomorphism techniques. Because we extract information from sample behavior logs, we avoid the problem of obfuscated samples resistant to static analysis tools. We have evaluated our method on a corpus of 959 samples and achieve high precision given known malware family labels.