论文信息 - Hash-Grams: Faster N-Gram Features for Classification and Malware Detection

Hash-Grams: Faster N-Gram Features for Classification and Malware Detection

N-grams have long been used as features for classification problems, and their distribution often allows selection of the top-k occurring n-grams as a reliable first-pass to feature selection. However, this top-k selection can be a performance bottleneck, especially when dealing with massive item sets and corpora. In this work we introduce Hash-Grams, an approach to perform top-k feature mining for classification problems. We show that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms. Using a malware corpus of over 2 TB in size, we show how Hash-Grams retain comparable classification accuracy, while dramatically reducing computational requirements.

Edward Raff | Charles K. Nicholas | Charles Nicholas | Edward Raff

[1] Marcus A. Maloof,et al. Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[2] Divyakant Agrawal,et al. Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[3] Edward Raff,et al. What can N-grams learn for malware detection? , 2017, 2017 12th International Conference on Malicious and Unwanted Software (MALWARE).

[4] Edward Raff,et al. JSAT: Java Statistical Analysis Tool, a Library for Machine Learning , 2017, J. Mach. Learn. Res..

[5] Jon Barker,et al. Malware Detection by Eating a Whole EXE , 2017, AAAI Workshops.

[6] George Kingsley Zipf,et al. Human behavior and the principle of least effort , 1949 .

[7] Ping Li,et al. In Defense of Minhash over Simhash , 2014, AISTATS.

[8] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[9] Edward Raff,et al. Malware Classification and Class Imbalance via Stochastic Hashed LZJD , 2017, AISec@CCS.

[10] Kilian Q. Weinberger,et al. Feature hashing for large scale multitask learning , 2009, ICML '09.

[11] Marco Pulimeno,et al. Parallel space saving on multi‐ and many‐core processors , 2016, Concurr. Comput. Pract. Exp..

[12] Uri Zwick,et al. Median Selection Requires (2+epsilon)n Comparisons , 2001, SIAM J. Discret. Math..

[13] Aoying Zhou,et al. Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[14] H. Zou,et al. Regularization and variable selection via the elastic net , 2005 .

[15] M Damashek,et al. Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[16] Edward Raff,et al. An investigation of byte n-gram features for malware classification , 2018, Journal of Computer Virology and Hacking Techniques.

[17] Moses Charikar,et al. Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[18] Graham Cormode,et al. Summarizing and Mining Skewed Data Streams , 2005, SDM.

[19] Salvatore J. Stolfo,et al. Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.