Finding Clustering Configurations to Accurately Infer Packet Structures from Network Data

Clustering is often used for reverse engineering network protocols from captured network traces. The performance of clustering techniques is often contingent upon the selection of various parameters, which can have a severe impact on clustering quality. In this paper we experimentally investigate the effect of four different parameters with respect to network traces. We also determining the optimal parameter configuration with respect to traces from four different network protocols. Our results indicate that the choice of distance measure and the length of the message has the most substantial impact on cluster accuracy. Depending on the type of protocol, the $n$-gram length can also have a substantial impact.

[1]  Dawn Xiaodong Song,et al.  Dispatcher: enabling active botnet infiltration using automatic protocol reverse-engineering , 2009, CCS.

[2]  Stefan Savage,et al.  Unexpected means of protocol inference , 2006, IMC '06.

[3]  Frits W. Vaandrager,et al.  Improving active Mealy machine learning for protocol conformance testing , 2014, Machine Learning.

[4]  Christopher Krügel,et al.  Prospex: Protocol Specification Extraction , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[5]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[6]  Michalis Faloutsos,et al.  Internet traffic classification demystified: myths, caveats, and the best practices , 2008, CoNEXT '08.

[7]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[8]  Dawn Xiaodong Song,et al.  Fig: Automatic Fingerprint Generation , 2007, NDSS.

[9]  Li Guo,et al.  A semantics aware approach to automated reverse engineering unknown protocols , 2012, 2012 20th IEEE International Conference on Network Protocols (ICNP).

[10]  C. Tappert,et al.  A Survey of Binary Similarity and Distance Measures , 2010 .

[11]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[12]  Michalis Vazirgiannis,et al.  Clustering validity assessment: finding the optimal partitioning of a data set , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[13]  John Derrick,et al.  Increasing Functional Coverage by Inductive Testing: A Case Study , 2010, ICTSS.

[14]  Vern Paxson,et al.  Bro: a system for detecting network intruders in real-time , 1998, Comput. Networks.

[15]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[16]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[17]  Jerome L. Myers,et al.  Research Design and Statistical Analysis , 1991 .

[18]  Larry L. Peterson,et al.  binpac: a yacc for writing application protocol parsers , 2006, IMC '06.

[19]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[20]  Li Guo,et al.  Inferring Protocol State Machine from Network Traces: A Probabilistic Approach , 2011, ACNS.

[21]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[22]  L. Hubert,et al.  Comparing partitions , 1985 .

[23]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[24]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[25]  Li Guo,et al.  Biprominer: Automatic Mining of Binary Protocol Features , 2011, 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[26]  Jacob Cohen,et al.  QUANTITATIVE METHODS IN PSYCHOLOGY A Power Primer , 1992 .

[27]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[28]  Helen J. Wang,et al.  Discoverer: Automatic Protocol Reverse Engineering from Network Traces , 2007, USENIX Security Symposium.

[29]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[30]  Spiros Mancoridis,et al.  A Reverse Engineering Tool for Extracting Protocols of Networked Applications , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[31]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Konrad Rieck,et al.  Linear-Time Computation of Similarity Measures for Sequential Data , 2008, J. Mach. Learn. Res..

[33]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[34]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[35]  Pedram Amini,et al.  Fuzzing: Brute Force Vulnerability Discovery , 2007 .