HAC-T and Fast Search for Similarity in Security

Similarity digests have gained popularity for many security applications like blacklisting/whitelisting, and finding similar variants of malware. TLSH has been shown to be particularly good at hunting similar malware, and is resistant to evasion as compared to other similarity digests like ssdeep and sdhash. Searching and clustering are fundamental tools which help the security analysts and security operations center (SOC) operators in hunting and analyzing malware. Current approaches which aim to cluster malware are not scalable enough to keep up with the vast amount of malware and goodware available in the wild. In this paper, we present techniques which allow for fast search and clustering of TLSH hash digests which can aid analysts to inspect large amounts of malware/goodware. Our approach builds on fast nearest neighbor search techniques to build a tree-based index which performs fast search based on TLSH hash digests. The tree-based index is used in our threshold based Hierarchical Agglomerative Clustering (HAC-T) algorithm which is able to cluster digests in a scalable manner. Our clustering technique can cluster digests in O (n logn) time on average. We performed an empirical evaluation by comparing our approach with many standard and recent clustering techniques. We demonstrate that our approach is much more scalable and still is able to produce good cluster quality. We measured cluster quality using purity on 10 million samples obtained from VirusTotal. We obtained a high purity score in the range from 0.97 to 0.98 using labels from five major anti-virus vendors (Kaspersky, Microsoft, Symantec, Sophos, and McAfee) which demonstrates the effectiveness of the proposed method.

[1]  Jiyong Jang,et al.  Android Malware Clustering through Malicious Payload Mining , 2017, RAID.

[2]  Edward Raff,et al.  Lempel-Ziv Jaccard Distance, an Effective Alternative to Ssdeep and Sdhash , 2017, Digit. Investig..

[3]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[4]  Nirmal Singh,et al.  ByteFreq: Malware clustering using byte frequency , 2016, 2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO).

[5]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[6]  Davide Balzarotti,et al.  Beyond Precision and Recall: Understanding Uses (and Misuses) of Similarity Hashes in Binary Analysis , 2018, CODASPY.

[7]  Aziz Mohaisen,et al.  AMAL: High-fidelity, behavior-based automated malware analysis and classification , 2014, Comput. Secur..

[8]  Rakesh M. Verma,et al.  Performance Evaluation of Features and Clustering Algorithms for Malware , 2018, 2018 IEEE International Conference on Data Mining Workshops (ICDMW).

[9]  Andrew S. Gearhart,et al.  Quantifying the Effectiveness of Software Diversity using Near-Duplicate Detection Algorithms , 2018, MTD@CCS.

[10]  Kang G. Shin,et al.  MutantX-S: Scalable Malware Clustering Based on Static Features , 2013, USENIX Annual Technical Conference.

[11]  Yury A. Malkov,et al.  Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Wanlei Zhou,et al.  Static malware clustering using enhanced deep embedding method , 2019, Concurr. Comput. Pract. Exp..

[13]  Nicolas Christin,et al.  Automatic Application Identification from Billions of Files , 2017, KDD.

[14]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[15]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[16]  Jiyong Jang,et al.  Experimental study of fuzzy hashing in malware clustering analysis , 2015 .

[17]  Scott Forman,et al.  Using Randomization to Attack Similarity Digests , 2014 .

[18]  Stephen Blott,et al.  An Approximation- Based Data Structure for Similarity Search , 2006 .

[19]  David Brumley,et al.  BitShred: feature hashing malware for scalable triage and semantic analysis , 2011, CCS '11.

[20]  Matteo Dell'Amico,et al.  FISHDBC: Flexible, Incremental, Scalable, Hierarchical Density-Based Clustering for Arbitrary Data and Distance , 2019, ArXiv.

[21]  Roberto Baldoni,et al.  Malware family identification with BIRCH clustering , 2017, 2017 International Carnahan Conference on Security Technology (ICCST).

[22]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[23]  Gabriela Serban Czibula,et al.  HACGA: An artifacts-based clustering approach for malware classification , 2017, 2017 13th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP).

[24]  Jonathan Oliver,et al.  TLSH -- A Locality Sensitive Hash , 2013, 2013 Fourth Cybercrime and Trustworthy Computing Workshop.