Visualization of shared system call sequence relationships in large malware corpora

We present a novel system for automatically discovering and interactively visualizing shared system call sequence relationships within large malware datasets. Our system's pipeline begins with the application of a novel heuristic algorithm for extracting variable length, semantically meaningful system call sequences from malware system call behavior logs. Then, based on the occurrence of these semantic sequences, we construct a Boolean vector representation of the malware sample corpus. Finally we compute Jaccard indices pairwise over sample vectors to obtain a sample similarity matrix. Our graphical user interface links two visualizations within an interactive display. Our first view is a map-like visualization of similarity among the samples based on a reduced dimensional projection of our similarity matrix. Our second view provides insight into similarities and differences between selected malware samples in terms of the system call sequences they share. We also provide a set of interactive filters based on malicious behavioral traits. The integration of these views into an interactive, linked display allows users to comprehend the overall similarity structure of a malware corpus, inspect how behavioral traits distribute over the corpus, and to drill in to inspect local similarities and differences between samples.

[1]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[2]  Felix C. Freiling,et al.  Visual analysis of malware behavior using treemaps and thread graphs , 2009, 2009 6th International Workshop on Visualization for Cyber Security.

[3]  Lorie M. Liebrock,et al.  Visualizing compiled executables for malware analysis , 2009, 2009 6th International Workshop on Visualization for Cyber Security.

[4]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[5]  Gregory J. Conti,et al.  Visual Reverse Engineering of Binary and Data Files , 2008, VizSEC.