Polymorphic malware detection and identification via context-free grammar homomorphism

Computer viruses continue to proliferate despite the use of virus detection systems (VDS). This is due to VDS inability to detect variants not represented in signature databases. Detection systems look for contiguous byte sequences, use regular expressions for noncontiguous sequences, or detect initial behavior within a sandbox. Recent research has focused on using control-flow graph isomorphism in detection. These techniques are ineffective at detecting some polymorphs, which change their byte sequences and initial behavior and produce nonisomorphic control-flow graphs. Our approach compares program hierarchical structure. We observed that polymorphic instances are variants of the same program, these variants use the same algorithm, and a program's algorithm determines its hierarchical structure. Our technique maps a program's hierarchical structure to a context-free grammar, normalizes the grammar, and uses a fast check for homomorphism between the normalized grammars. © 2007 Alcatel-Lucent.