Malware detection on windows audit logs using LSTMs

Abstract Malware is a constant threat and is continuously evolving. Security systems try to keep up with the constant change. One challenge that arises is the large amount of logs generated on an operating system and the need to clarify which information contributes to the detection of possible malware. This work aims at the detection of malware using neural networks based on Windows audit log events. Neural networks can only process continuous data, but Windows audit logs are sequential and textual data. To address these challenges, we extract features out of the audit log events and use LSTMs to capture sequential effects. We create different subsets of features and analyze the effects of additional information. Features describe for example the action-type of windows audit log events, process names or target files that are accessed. Textual features are represented either as one-hot encoding or embedding representation, for which we compare three different approaches for representation learning. Effects of different feature subsets and representations are evaluated on a publicly available data set. Results indicate that using additional information improves the performance of the LSTM-model. While different representations lead to similar classification results, analysis of the latent space shows differences more precisely where FastText seems to be the most promising representation.

[1]  Mourad Debbabi,et al.  Network malware classification comparison using DPI and flow packet headers , 2015, Journal of Computer Virology and Hacking Techniques.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  Andreas Hotho,et al.  IP2Vec: Learning Similarities Between IP Addresses , 2017, 2017 IEEE International Conference on Data Mining Workshops (ICDMW).

[5]  Lei Zhang,et al.  Semi-Supervised Malware Clustering Based on the Weight of Bytecode and API , 2020, IEEE Access.

[6]  Salvatore J. Stolfo,et al.  Modeling system calls for intrusion detection with dynamic window sizes , 2001, Proceedings DARPA Information Survivability Conference and Exposition II. DISCEX'01.

[7]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[8]  Jack W. Stokes,et al.  Malware classification with LSTM and GRU language models and a character-level CNN , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[10]  Rahil Hosseini,et al.  A state-of-the-art survey of malware detection approaches using data mining techniques , 2018, Human-centric Computing and Information Sciences.

[11]  Andreas Hotho,et al.  A Survey of Network-based Intrusion Detection Data Sets , 2019, Comput. Secur..

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Hai Jin,et al.  Graph Processing on GPUs , 2018, ACM Comput. Surv..

[14]  Wenbo Guo,et al.  Adversary Resistant Deep Neural Networks with an Application to Malware Detection , 2016, KDD.

[15]  Paul Jacob,et al.  Host Based Intrusion Detection System with Combined CNN/RNN Model , 2018, Nemesis/UrbReas/SoGood/IWAISe/GDM@PKDD/ECML.

[16]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[17]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[18]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[19]  Andrew S. Miner,et al.  Anomaly intrusion detection using one class SVM , 2004, Proceedings from the Fifth Annual IEEE SMC Information Assurance Workshop, 2004..

[20]  Erhan Guven,et al.  A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection , 2016, IEEE Communications Surveys & Tutorials.

[21]  Yuval Elovici,et al.  Unknown malcode detection via text categorization and the imbalance problem , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.

[22]  Sattar Hashemi,et al.  Malware detection based on mining API calls , 2010, SAC '10.

[23]  David Slater,et al.  Malicious Behavior Detection using Windows Audit Logs , 2015, AISec@CCS.

[24]  Andreas Hotho,et al.  Comparison of System Call Representations for Intrusion Detection , 2019, CISIS-ICEUTE.

[25]  Tarrah R. Glass-Vanderlan,et al.  A Survey of Intrusion Detection Systems Leveraging Host Data , 2018, ACM Comput. Surv..

[26]  Konstantin Berlin,et al.  Deep neural network based malware detection using two dimensional binary program features , 2015, 2015 10th International Conference on Malicious and Unwanted Software (MALWARE).

[27]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[28]  Wei Zhang,et al.  Semantics-Based Online Malware Detection: Towards Efficient Real-Time Protection Against Malware , 2016, IEEE Transactions on Information Forensics and Security.

[29]  Wenbo Guo,et al.  Defending Against Adversarial Samples Without Security through Obscurity , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[30]  S. Sitharama Iyengar,et al.  A Survey on Malware Detection Using Data Mining Techniques , 2017, ACM Comput. Surv..