Identification of Malware Families for Creating Generic Signatures : Using Dynamic Analysis and Clustering Methods*

An important step in fighting malware is the creation of generalized signatures for the detection and removal of these malware. Millions of new samples are received in anti-malware research labs every day. Generating signatures for these malware requires many techniques i.e. static and dynamic analysis, reverse engineering and identification of malware families, etc. In this research, we used unsupervised learning methods to identify various malware clusters (families). Once similar malware is clustered together then there is no need for generating unique one-to-one signatures to detect this similar malware. Instead, only a generalized signature is enough to detect most of the malware in a cluster. This approach not only speeds up the detection of malware but also reduces the frequency and volume of signature updates on client-side anti-malware applications. We performed a dynamic analysis of 1247 malware samples of different families in a controlled environment (cuckoo sandbox) and features relevant to classification /clustering were extracted through python scripts (1194 features from each sample). Then machine learning methods for unsupervised learning were trained via these features. K-means, Mini-batch K-means, Agglomerative clustering, spectral clustering, and density-based clustering methods were applied to our dataset and 10 distinct clusters were identified based on best scores. Clustering being a heuristic approach, performed well in this work. Visualization of resulting clusters/groups confirmed the presence of different families. The best score was obtained using K Means and mini-batch K Means with n=10.