From NLP (Natural Language Processing) to MLP (Machine Language Processing)

Natural Language Processing (NLP) in combination with Machine Learning techniques plays an important role in the field of automatic text analysis. Motivated by the successful use of NLP in solving text classification problems in the area of e-Participation and inspired by our prior work in the field of polymorphic shellcode detection we gave classical NLP-processes a trial in the special case of malicious code analysis. Any malicious program is based on some kind of machine language, ranging from manually crafted assembler code that exploits a buffer overflow to high level languages such as Javascript used in web-based attacks. We argue that well known NLP analysis processes can be modified and applied to the malware analysis domain. Similar to the NLP process we call this process Machine Language Processing (MLP). In this paper, we use our e-Participation analysis architecture, extract the various NLP techniques and adopt them for the malware analysis process. As proof-of-concept we apply the adopted framework to malicious code examples from Metasploit.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Computer Network Security , 2005 .

[3]  P. N. Suganthan,et al.  Robust growing neural gas algorithm with application in cluster analysis , 2004, Neural Networks.

[4]  H. Chertkow,et al.  Semantic memory , 2002, Current neurology and neuroscience reports.

[5]  Zhenkai Liang,et al.  BitBlaze: A New Approach to Computer Security via Binary Analysis , 2008, ICISS.

[6]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[7]  Peter Parycek,et al.  Automated Analysis of e-Participation Data by Utilizing Associative Networks, Spreading Activation and Unsupervised Learning , 2009, ePart.

[8]  Amit Vasudevan,et al.  Cobra: fine-grained malware analysis using stealth localized-executions , 2006, 2006 IEEE Symposium on Security and Privacy (S&P'06).

[9]  Hideki Kozima,et al.  Similarity between Words Computed by Spreading Activation on an English Dictionary , 1993, EACL.

[10]  Fabio Crestani,et al.  Application of Spreading Activation Techniques in Information Retrieval , 1997, Artificial Intelligence Review.

[11]  Andrew Zisserman,et al.  Advances in Neural Information Processing Systems (NIPS) , 2007 .

[12]  Udo Payer,et al.  Hybrid Engine for Polymorphic Shellcode Detection , 2005, DIMVA.

[13]  Michalis Vazirgiannis,et al.  Word Sense Disambiguation with Spreading Activation Networks Generated from Thesauri , 2007, IJCAI.

[14]  Stefan Kraxberger,et al.  Massive Data Mining for Polymorphic Code Detection , 2005, MMM-ACNS.