Classy: fast clustering streams of call-graphs

An abstraction resilient to common malware obfuscation techniques is the call-graph. A call-graph is the representation of an executable file as a directed graph with labeled vertices, where the vertices correspond to functions and the edges to function calls. Unfortunately, most of the interesting graph comparison problems, including full-graph comparison and computing the largest common subgraph, belong to the $$NP$$NP-hard class. This makes the study and use of graphs in large scale systems difficult. Existing work has focused only on offline clustering and has not addressed the issue of clustering streams of graphs. In this paper we present Classy, a scalable distributed system that clusters streams of large call-graphs for purposes including automated malware classification and facilitating malware analysts. Since algorithms aimed at clustering sets are not suitable for clustering streams of objects, we propose the use of a clustering algorithm that relies on the notion of candidate clusters and reference samples therein. We demonstrate via thorough experimentation that this approach yields results very close to the offline optimal. Graph similarity is determined by computing a graph edit distance (GED) of pairs of graphs using an adapted version of simulated annealing. Furthermore, we present a novel lower bound for the GED. We also study the problem of approximating statistics of clusters of graphs when the distances of only a fraction of all possible pairs have been computed. Finally, we present results and statistics from a real production-side system that has clustered and contains more than 0.8 million graphs.

[1]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[2]  Mila Dalla Preda,et al.  Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop , 2013, POPL 2013.

[3]  Behrouz Homayoun Far,et al.  Clustering Social Networks to Remove Neutral Nodes , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[4]  Joris Kinable,et al.  Malware classification based on call graph clustering , 2010, Journal in Computer Virology.

[5]  Ambuj K. Singh,et al.  Closure-Tree: An Index Structure for Graph Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[7]  S. V. N. Vishwanathan,et al.  Graph kernels , 2007 .

[8]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[9]  Robert E. Tarjan,et al.  Clustering Social Networks , 2007, WAW.

[10]  Anthony K. H. Tung,et al.  Comparing Stars: On Approximating Graph Edit Distance , 2009, Proc. VLDB Endow..

[11]  George Kollios,et al.  Clustering Large Probabilistic Graphs , 2013, IEEE Transactions on Knowledge and Data Engineering.

[12]  Tamer Kahveci,et al.  Reference-based indexing of sequence databases , 2006, VLDB.

[13]  Dennis Shasha,et al.  GraphGrep: A fast and universal method for querying graphs , 2002, Object recognition supported by user interaction for service robots.

[14]  T. Akutsu A Polynomial Time Algorithm for Finding a Largest Common Subgraph of almost Trees of Bounded Degree , 1993 .

[15]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[16]  Jan Ramon,et al.  Expressivity versus efficiency of graph kernels , 2003 .

[17]  Christopher Krügel,et al.  Limits of Static Analysis for Malware Detection , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[18]  Kurt Mehlhorn,et al.  Weisfeiler-Lehman Graph Kernels , 2011, J. Mach. Learn. Res..

[19]  Philip S. Yu,et al.  GString: A Novel Approach for Efficient Search in Graph Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[20]  Wei Wang,et al.  Graph Database Indexing Using Structured Graph Decomposition , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[21]  Pavel Laskov,et al.  Detection of Intrusions and Malware, and Vulnerability Assessment: 19th International Conference, DIMVA 2022, Cagliari, Italy, June 29 –July 1, 2022, Proceedings , 2022, International Conference on Detection of intrusions and malware, and vulnerability assessment.

[22]  Sun-Yuan Kung,et al.  Coding and comparison of DAG's as a novel neural structure with applications to on-line handwriting recognition , 1997, IEEE Trans. Signal Process..

[23]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[24]  Halvar Flake,et al.  Structural Comparison of Executable Objects , 2004, DIMVA.

[25]  Heng Yin,et al.  Renovo: a hidden code extractor for packed executables , 2007, WORM '07.

[26]  Christopher Krügel,et al.  Exploring Multiple Execution Paths for Malware Analysis , 2007, 2007 IEEE Symposium on Security and Privacy (SP '07).

[27]  Kang G. Shin,et al.  Large-scale malware indexing using function-call graphs , 2009, CCS.

[28]  Veeramani,et al.  Windows API based Malware Detection and Framework Analysis , 2012 .

[29]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[30]  Maurice Bruynooghe,et al.  A polynomial-time maximum common subgraph algorithm for outerplanar graphs and its application to chemoinformatics , 2013, Annals of Mathematics and Artificial Intelligence.

[31]  Andy King,et al.  BinSlayer: accurate comparison of binary executables , 2013, PPREW '13.

[32]  Wilfred Ng,et al.  Efficient query processing on graph databases , 2009, TODS.

[33]  Konrad Rieck,et al.  Structural detection of android malware using embedded call graphs , 2013, AISec.

[34]  Philip S. Yu,et al.  On Clustering Graph Streams , 2010, SDM.

[35]  Gran Vía,et al.  GRAPHS, ENTROPY AND GRID COMPUTING: AUTOMATIC COMPARISON OF MALWARE , 2008 .

[36]  Somesh Jha,et al.  Testing malware detectors , 2004, ISSTA '04.

[37]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, ICML '05.

[38]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[39]  Mario Vento,et al.  Thirty Years Of Graph Matching In Pattern Recognition , 2004, Int. J. Pattern Recognit. Artif. Intell..

[40]  Somesh Jha,et al.  OmniUnpack: Fast, Generic, and Safe Unpacking of Malware , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[41]  Felix C. Freiling,et al.  Toward Automated Dynamic Malware Analysis Using CWSandbox , 2007, IEEE Secur. Priv..

[42]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[43]  Alexander Ilin,et al.  Methodology for Behavioral-based Malware Analysis and Detection Using Random Projections and K-Nearest Neighbors Classifiers , 2011, 2011 Seventh International Conference on Computational Intelligence and Security.

[44]  Satu Elisa Schaeffer,et al.  Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[45]  Aristides Gionis,et al.  Clustering Aggregation , 2005, ICDE.

[46]  A.H. Sung,et al.  Polymorphic malicious executable scanner by API sequence analysis , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[47]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[48]  Enrique V. Carrera,et al.  Digital genome mapping: ad-vanced binary malware analysis , 2004 .

[49]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[50]  Bazara I. A. Barry,et al.  Improving the Detection of Malware Behaviour Using Simplified Data Dependent API Call Graph , 2013 .

[51]  M. Cugmas,et al.  On comparing partitions , 2015 .

[52]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[53]  Kaspar Riesen,et al.  Approximate graph edit distance computation by means of bipartite graph matching , 2009, Image Vis. Comput..

[54]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.

[55]  Jan van Leeuwen,et al.  Worst-case Analysis of Set Union Algorithms , 1984, JACM.

[56]  Mark S. Boddy,et al.  An Analysis of Time-Dependent Planning , 1988, AAAI.

[57]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[58]  Barbara G. Ryder,et al.  Constructing the Call Graph of a Program , 1979, IEEE Transactions on Software Engineering.

[59]  Christopher Krügel,et al.  Effective and Efficient Malware Detection at the End Host , 2009, USENIX Security Symposium.

[60]  Hiroshi Murase,et al.  On-line handwriting recognition , 1992, Proc. IEEE.

[61]  Nils M. Kriege,et al.  Subgraph Matching Kernels for Attributed Graphs , 2012, ICML.

[62]  Dimitrios Gunopulos,et al.  Reference-Based Alignment in Large Sequence Databases , 2009, Proc. VLDB Endow..

[63]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[64]  Horst Bunke,et al.  On a relation between graph edit distance and maximum common subgraph , 1997, Pattern Recognit. Lett..

[65]  Jignesh M. Patel,et al.  TALE: A Tool for Approximate Large Graph Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[66]  Réjean Plamondon,et al.  On-line handwriting recognition. , 1999 .

[67]  Jian Xu,et al.  A similarity metric method of obfuscated malware using function-call graph , 2012, Journal of Computer Virology and Hacking Techniques.

[68]  Joris Kinable,et al.  Improved call graph comparison using simulated annealing , 2011, SAC.

[69]  Philip S. Yu,et al.  Graph Indexing: Tree + Delta >= Graph , 2007, VLDB.

[70]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[71]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[72]  Thomas Dullien,et al.  Graph-based comparison of Executable Objects , 2005 .

[73]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.