Scalable Malware Clustering using Multi-Stage Tree Parallelization

Similarity hashing is an important tool for searching and analyzing malware samples which are similar to known malware samples. Several similarity hashing schemes exist in the literature (like ssdeep, TLSH, sdhash). TLSH has been found to be particularly well-suited for finding related malware (and goodware) samples from known malware (and goodware) samples. In particular, TLSH has been shown to be good at finding the different variants of a given malware. Previous work has shown that TLSH hashes can be used to build fast search and clustering techniques which can scale to tens of millions of items. In this paper, we show that previous work can be made to scale to even larger data sizes by doing clustering in stages. A fast clustering algorithm (like k-Means) can be used in multiple stages to obtain clusters at a coarse-level, which can later be processed by other state-of-the-art clustering techniques in parallel to obtain final clusters. We show that such a multi-stage technique can be used to cluster up to 10 million items with 9-12x speedup over just using existing state-of-the-art clustering techniques. We show that the resulting cluster quality obtained by multi-stage clustering is comparable to the cluster quality obtained by existing methods. Moreover, we show how to optimize the cost (dollars spent on the cloud) or latency incurred by multi-stage clustering technique by choosing appropriate values of parameters

[1]  Gabriela Serban Czibula,et al.  HACGA: An artifacts-based clustering approach for malware classification , 2017, 2017 13th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP).

[2]  Wanlei Zhou,et al.  Static malware clustering using enhanced deep embedding method , 2019, Concurr. Comput. Pract. Exp..

[3]  Yong Chen,et al.  Automatic malware categorization using cluster ensemble , 2010, KDD.

[4]  Jonathan Oliver,et al.  TLSH -- A Locality Sensitive Hash , 2013, 2013 Fourth Cybercrime and Trustworthy Computing Workshop.

[5]  Jiyong Jang,et al.  Android Malware Clustering through Malicious Payload Mining , 2017, RAID.

[6]  Muqeet Ali,et al.  HAC-T and Fast Search for Similarity in Security , 2020, 2020 International Conference on Omni-layer Intelligent Systems (COINS).

[7]  Kang G. Shin,et al.  DUET: integration of dynamic and static analyses for malware clustering with cluster ensembles , 2013, ACSAC.

[8]  Roberto Baldoni,et al.  Malware family identification with BIRCH clustering , 2017, 2017 International Carnahan Conference on Security Technology (ICCST).

[9]  Nicolas Christin,et al.  Automatic Application Identification from Billions of Files , 2017, KDD.

[10]  Jiyong Jang,et al.  Experimental study of fuzzy hashing in malware clustering analysis , 2015 .

[11]  Scott Forman,et al.  Using Randomization to Attack Similarity Digests , 2014 .

[12]  Matteo Dell'Amico,et al.  FISHDBC: Flexible, Incremental, Scalable, Hierarchical Density-Based Clustering for Arbitrary Data and Distance , 2019, ArXiv.

[13]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[14]  Nirmal Singh,et al.  ByteFreq: Malware clustering using byte frequency , 2016, 2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO).

[15]  Andrew S. Gearhart,et al.  Quantifying the Effectiveness of Software Diversity using Near-Duplicate Detection Algorithms , 2018, MTD@CCS.

[16]  Rakesh M. Verma,et al.  Performance Evaluation of Features and Clustering Algorithms for Malware , 2018, 2018 IEEE International Conference on Data Mining Workshops (ICDMW).

[17]  David Brumley,et al.  BitShred: feature hashing malware for scalable triage and semantic analysis , 2011, CCS '11.

[18]  Kang G. Shin,et al.  MutantX-S: Scalable Malware Clustering Based on Static Features , 2013, USENIX Annual Technical Conference.

[19]  Roberto Perdisci,et al.  Scalable fine-grained behavioral clustering of HTTP-based malware , 2013, Comput. Networks.

[20]  Davide Balzarotti,et al.  Beyond Precision and Recall: Understanding Uses (and Misuses) of Similarity Hashes in Binary Analysis , 2018, CODASPY.

[21]  Aziz Mohaisen,et al.  AMAL: High-fidelity, behavior-based automated malware analysis and classification , 2014, Comput. Secur..