Application-Oblivious L7 Parsing Using Recurrent Neural Networks

Extracting fields from layer 7 protocols such as HTTP, known as L7 parsing, is the key to many critical network applications. However, existing L7 parsing techniques center around protocol specifications, thereby incurring large human efforts in specifying data format and high computational/memory costs that poorly scale with the explosive number of L7 protocols. To this end, this paper introduces a new framework named content-based L7 parsing, where the content instead of the format becomes the first class citizen. Under this framework, users only need to label what content they are interested in, and the parser learns an extraction model from the users’ labeling behaviors. Since the parser is specification-independent, both the human effort and computational/memory costs can be dramatically reduced. To realize content-based L7 parsing, we propose REPLAY which builds on recurrent neural network (RNN) and addresses a series of technical challenges like large labeling overhead and slow parsing speed. We prototype REPLAY on GPUs, and show it can achieve a precision of 98% and a recall of 97%, with a throughput as high as 12Gbps for diverse extraction tasks.

[1]  김종영 구글 TensorFlow 소개 , 2015 .

[2]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Rong Jin,et al.  Large-scale text categorization by batch mode active learning , 2006, WWW '06.

[5]  Patrick Crowley,et al.  Algorithms to accelerate multiple regular expressions matching for deep packet inspection , 2006, SIGCOMM.

[6]  Li Guo,et al.  A semantics aware approach to automated reverse engineering unknown protocols , 2012, 2012 20th IEEE International Conference on Network Protocols (ICNP).

[7]  Hao Li,et al.  Parsing application layer protocol with commodity hardware for SDN , 2015, 2015 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[8]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[9]  Cheng-Yuan Liou,et al.  Autoencoder for words , 2014, Neurocomputing.

[10]  Michael C. Mozer,et al.  Induction of Multiscale Temporal Structure , 1991, NIPS.

[11]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[12]  Eric Torng,et al.  FlowSifter: A counting automata approach to layer 7 field extraction for deep flow inspection , 2012, 2012 Proceedings IEEE INFOCOM.

[13]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[14]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[15]  Helen J. Wang,et al.  Generic Application-Level Protocol Analyzer and its Language , 2007, NDSS.

[16]  Zhenkai Liang,et al.  Polyglot: automatic extraction of protocol message format using dynamic binary analysis , 2007, CCS '07.

[17]  Rushi Longadge,et al.  Class Imbalance Problem in Data Mining Review , 2013, ArXiv.

[18]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[19]  Yu Zhou,et al.  A Semantics-Aware Approach to the Automated Network Protocol Identification , 2016, IEEE/ACM Transactions on Networking.

[20]  Christopher Kermorvant,et al.  Dropout Improves Recurrent Neural Networks for Handwriting Recognition , 2013, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[21]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[22]  Bin Liu,et al.  NetShield: massive semantics-based vulnerability signature matching for high-speed networks , 2010, SIGCOMM '10.

[23]  Larry L. Peterson,et al.  binpac: a yacc for writing application protocol parsers , 2006, IMC '06.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[26]  Tingting He,et al.  Learning semantic representation with neural networks for community question answering retrieval , 2016, Knowl. Based Syst..

[27]  Stefan Savage,et al.  Unexpected means of protocol inference , 2006, IMC '06.

[28]  Chaofan Shen,et al.  On Detection Accuracy of L7-filter and OpenDPI , 2012, 2012 Third International Conference on Networking and Distributed Computing.

[29]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[30]  Sethuraman Panchanathan,et al.  Active Batch Selection via Convex Relaxations with Guaranteed Solution Bounds , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  John C. Klensin,et al.  Simple Mail Transfer Protocol , 2001, RFC.

[32]  Yuming Jiang,et al.  Deep semantics inspection over big network data at wire speed , 2016, IEEE Network.

[33]  Somesh Jha,et al.  XFA: Faster Signature Matching with Extended Automata , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[34]  Brighten Godfrey,et al.  VeriFlow: verifying network-wide invariants in real time , 2012, HotSDN '12.