Hash-Grams: Faster N-Gram Features for Classification and Malware Detection

N-grams have long been used as features for classification problems, and their distribution often allows selection of the top-k occurring n-grams as a reliable first-pass to feature selection. However, this top-k selection can be a performance bottleneck, especially when dealing with massive item sets and corpora. In this work we introduce Hash-Grams, an approach to perform top-k feature mining for classification problems. We show that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms. Using a malware corpus of over 2 TB in size, we show how Hash-Grams retain comparable classification accuracy, while dramatically reducing computational requirements.

[1]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[2]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[3]  Edward Raff,et al.  What can N-grams learn for malware detection? , 2017, 2017 12th International Conference on Malicious and Unwanted Software (MALWARE).

[4]  Edward Raff,et al.  JSAT: Java Statistical Analysis Tool, a Library for Machine Learning , 2017, J. Mach. Learn. Res..

[5]  Jon Barker,et al.  Malware Detection by Eating a Whole EXE , 2017, AAAI Workshops.

[6]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[7]  Ping Li,et al.  In Defense of Minhash over Simhash , 2014, AISTATS.

[8]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[9]  Edward Raff,et al.  Malware Classification and Class Imbalance via Stochastic Hashed LZJD , 2017, AISec@CCS.

[10]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[11]  Marco Pulimeno,et al.  Parallel space saving on multi‐ and many‐core processors , 2016, Concurr. Comput. Pract. Exp..

[12]  Uri Zwick,et al.  Median Selection Requires (2+epsilon)n Comparisons , 2001, SIAM J. Discret. Math..

[13]  Aoying Zhou,et al.  Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[14]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[15]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[16]  Edward Raff,et al.  An investigation of byte n-gram features for malware classification , 2018, Journal of Computer Virology and Hacking Techniques.

[17]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[18]  Graham Cormode,et al.  Summarizing and Mining Skewed Data Streams , 2005, SDM.

[19]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.