Hierarchical management of large-scale malware data

As the pace of generation of new malware accelerates, clustering and classifying newly discovered malware requires new approaches to data management. We describe our Big Data approach to managing malware to support effective and efficient malware analysis on large and rapidly evolving sets of malware. The key element of our approach is a hierarchical organization of the malware, which organizes malware into families, maintains a rich description of the relationships between malware, and facilitates efficient online analysis of new malware as they are discovered. Using clustering evaluation metrics, we show that our system discovers malware families comparable to those produced by traditional hierarchical clustering algorithms, while scaling much better with the size of the data set. We also show the flexibility of our system as it relates to substituting various data representations, methods of comparing malware binaries, clustering algorithms, and other factors. Our approach will enable malware analysts and investigators to quickly understand and quantify changes in the global malware ecosystem.

[1]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[2]  Jeffrey Scott Vitter,et al.  Proceedings of the thirtieth annual ACM symposium on Theory of computing , 1998, STOC 1998.

[3]  Ramayya Krishnan,et al.  Incremental hierarchical clustering of text documents , 2006, CIKM '06.

[4]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[5]  Jake Y. Chen,et al.  Biological Data Mining , 2009 .

[6]  Cheryl Z. Qian,et al.  Multi-aspect visual analytics on large-scale high-dimensional cyber security data , 2015, Inf. Vis..

[7]  Kunal Talwar,et al.  Consistent Weighted Sampling , 2007 .

[8]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[9]  Andrew Walenstein,et al.  Malware phylogeny generation using permutations of code , 2005, Journal in Computer Virology.

[10]  Roberto Perdisci,et al.  Scalable fine-grained behavioral clustering of HTTP-based malware , 2013, Comput. Networks.

[11]  Daniel A. Keim,et al.  BANKSAFE: Visual analytics for big data in large-scale computer networks , 2015, Inf. Vis..

[12]  Eul Gyu Im,et al.  Malware classification method via binary content comparison , 2012, RACS.

[13]  Arun Lakhotia,et al.  Fast location of similar code fragments using semantic 'juice' , 2013, PPREW '13.

[14]  Arun Lakhotia,et al.  Malware Analysis and attribution using Genetic Information , 2012, 2012 7th International Conference on Malicious and Unwanted Software.

[15]  Nathan Eagle Big data for social good , 2014, KDD.

[16]  Tei-Wei Kuo,et al.  Proceedings of the 2012 ACM Research in Applied Computation Symposium , 2012 .

[17]  Kieran Jay Edwards,et al.  Astronomy and Big Data , 2014 .

[18]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[19]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[20]  Arun Lakhotia,et al.  Identifying Shared Software Components to Support Malware Forensics , 2014, DIMVA.

[21]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[22]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[23]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .