Automatic protocol feature word construction based on machine learning

Automatic protocol reverse engineering for application protocol is becoming more and more important for many applications such as application protocol analyzer, penetration testing, intrusion prevention and detection. Unfortunately, many techniques for extracting the protocol message format specifications of unknown applications often have some limitations for few priori information or the time-consuming problem. Protocol feature words are byte subsequences within traffic payload that could help distinguish application protocols. In this paper, a new approach is proposed for extracting the protocol message format specifications of unknown applications which is based on the Latent Dirichlet Allocation (LDA) model and Huffman Tree Support Vector Machine (HT-SVM). Firstly, the key words are extracted by utilizing the LDA model, which is a kind of machine learning in document library to extract the theme structure named topic. Secondly, the HT-SVM method is applied to constructing the feature words based on the above process. The proposed approach is implemented and evaluated to infer message format specifications of SMTP binary protocol. Experimental results show that the approach accurately parses and infers SMTP protocol with highly recall rate.

[1]  Christopher Krügel,et al.  Prospex: Protocol Specification Extraction , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[2]  Antonio Pescapè,et al.  Issues and future directions in traffic classification , 2012, IEEE Network.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Du Hongle Network intrusion detection based on heterogeneous distance , 2014 .

[5]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[6]  Judith Kelner,et al.  A Survey on Internet Traffic Identification , 2009, IEEE Communications Surveys & Tutorials.

[7]  Jan van Leeuwen,et al.  On the Construction of Huffman Trees , 1976, ICALP.

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Dawn Xiaodong Song,et al.  Inference and analysis of formal models of botnet command and control protocols , 2010, CCS '10.

[10]  Sandro Etalle,et al.  N-Gram against the Machine: On the Feasibility of the N-Gram Network Analysis for Binary Protocols , 2012, RAID.

[11]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[12]  Helen J. Wang,et al.  Discoverer: Automatic Protocol Reverse Engineering from Network Traces , 2007, USENIX Security Symposium.

[13]  Zhenkai Liang,et al.  Polyglot: automatic extraction of protocol message format using dynamic binary analysis , 2007, CCS '07.

[14]  Xiu-yu Zhong,et al.  The Research And Application of Web Log Mining Based on the Platform Weka , 2011 .

[15]  Thomas W. Reps,et al.  Extracting Output Formats from Executables , 2006, 2006 13th Working Conference on Reverse Engineering.