Quantifying Malware Evolution through Archaeology

Dynamic analysis of malware allows us to examine malware samples, and then group those samples into families based on observed behavior. Using Boolean variables to represent the presence or absence of a range of malware behavior, we create a bitstring that represents each malware behaviorally, and then group samples into the same class if they exhibit the same behavior. Combining class definitions with malware discovery dates, we can construct a timeline of showing the emergence date of each class, in order to examine prevalence, complexity, and longevity of each class. We find that certain behavior classes are more prevalent than others, following a frequency power law. Some classes have had lower longevity, indicating that their attack profile is no longer manifested by new variants of malware, while others of greater longevity, continue to affect new computer systems. We verify for the first time commonly held intuitions on malware evolution, showing quantitatively from the archaeological record that over 80% of the time, classes of higher malware complexity emerged later than classes of lower complexity. In addition to providing historical perspective on malware evolution, the methods described in this paper may aid malware detection through classification, leading to new proactive methods to identify malicious software.