Polymorphic Malware Detection Using Hierarchical Hidden Markov Model

Binary signatures have been widely used to detect malicious software on the current Internet. However, this approach is unable to achieve the accurate identification of polymorphic malware variants, which can be easily generated by the malware authors using code generation engines. Code generation engines randomly produce varying code sequences but perform the same desired malicious functions. Previous research used flow graph and signature tree to identify polymorphic malware families. The key difficulty of previous research is the generation of precisely defined state machine models from polymorphic variants. This paper proposes a novel approach, using Hierarchical Hidden Markov Model (HHMM), to provide accurate inductive inference of the malware family. This model can capture the features of self-similar and hierarchical structure of polymorphic malware family signature sequences. To demonstrate the effectiveness and efficiency of this approach, we evaluate it with real malware samples. Using more than 15,000 real malware, we find our approach can achieve high true positives, low false positives, and low computational cost.