An important step in fighting malware is the creation of generalized signatures for the detection and removal of these malware. Millions of new samples are received in anti-malware research labs every day. Generating signatures for these malware requires many techniques i.e. static and dynamic analysis, reverse engineering and identification of malware families, etc. In this research, we used unsupervised learning methods to identify various malware clusters (families). Once similar malware is clustered together then there is no need for generating unique one-to-one signatures to detect this similar malware. Instead, only a generalized signature is enough to detect most of the malware in a cluster. This approach not only speeds up the detection of malware but also reduces the frequency and volume of signature updates on client-side anti-malware applications. We performed a dynamic analysis of 1247 malware samples of different families in a controlled environment (cuckoo sandbox) and features relevant to classification /clustering were extracted through python scripts (1194 features from each sample). Then machine learning methods for unsupervised learning were trained via these features. K-means, Mini-batch K-means, Agglomerative clustering, spectral clustering, and density-based clustering methods were applied to our dataset and 10 distinct clusters were identified based on best scores. Clustering being a heuristic approach, performed well in this work. Visualization of resulting clusters/groups confirmed the presence of different families. The best score was obtained using K Means and mini-batch K Means with n=10.
[1]
Kang G. Shin,et al.
Large Scale Malware Analysis, Detection and Signature Generation
,
2011
.
[2]
Christopher Krügel,et al.
A survey on automated dynamic malware-analysis techniques and tools
,
2012,
CSUR.
[3]
S. Sitharama Iyengar,et al.
A Survey on Malware Detection Using Data Mining Techniques
,
2017,
ACM Comput. Surv..
[4]
Yanfang Ye,et al.
Combining file content and file relations for cloud based malware detection
,
2011,
KDD.
[5]
Helen J. Wang,et al.
Scalable Telemetry Classification for Automated Malware Detection
,
2012,
ESORICS.
[6]
Christos Faloutsos,et al.
Polonium: Tera-Scale Graph Mining for Malware Detection
,
2013
.
[7]
Somesh Jha,et al.
Static Analysis of Executables to Detect Malicious Patterns
,
2003,
USENIX Security Symposium.
[8]
Karen A. Scarfone,et al.
Guide to Malware Incident Prevention and Handling for Desktops and Laptops
,
2013
.
[9]
Christopher Krügel,et al.
Limits of Static Analysis for Malware Detection
,
2007,
Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).
[10]
Yuval Elovici,et al.
A Chronological Evaluation of Unknown Malcode Detection
,
2009,
PAISI.