Unsupervised traffic classification using flow statistical properties and IP packet payload

In network traffic classification, "unknown applications" is a difficult problem unsolved. Conventional supervised classification methods classify any traffic flow into predefined classes, while cannot handle unknown applications without corresponding supervised data. Some unsupervised clustering algorithms, such as k-means, have been applied to group traffic flows automatically, but a large number of resulting clusters are unable to correctly represent a small number of real applications. To address the problem of unknown applications, we propose a novel unsupervised approach which has the capability to discover application-based traffic classes and classify traffic flows according to their generation applications. In the proposed approach, flow statistical properties and IP packet payload are used in combination to discover traffic classes in the training stage. We introduce a bag-of-words (BoW) model to represent the content of clusters constructed by using flow statistical features, and apply the latent semantic analysis (LSA) to aggregate similar traffic clusters based on their payload content. In the testing stage, only flow statistical features are used to classify traffic flows, that can protect user privacy and deal with known encrypted applications without inspecting IP packets. A number of experiments are carried out on a real-world traffic dataset to demonstrate the effectiveness and robustness of the proposed approach. Highlights? We proposed a novel unsupervised approach for network traffic classification. ? We introduced a bag-of-words model to represent the content of traffic clusters. ? We applied the latent semantic analysis to aggregate similar traffic clusters. ? For training, we combined flow statistical properties and IP packet payload. ? For testing, only flow statistical properties were used for classification.

[1]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[2]  Patrick Haffner,et al.  ACAS: automated construction of application signatures , 2005, MineNet '05.

[3]  Anthony McGregor,et al.  Flow Clustering Using Machine Learning Techniques , 2004, PAM.

[4]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[5]  Renata Teixeira,et al.  Early Recognition of Encrypted Applications , 2007, PAM.

[6]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[7]  Grenville J. Armitage,et al.  A survey of techniques for internet traffic classification using machine learning , 2008, IEEE Communications Surveys & Tutorials.

[8]  Michalis Faloutsos,et al.  Internet traffic classification demystified: myths, caveats, and the best practices , 2008, CoNEXT '08.

[9]  Marco Mellia,et al.  Mining Unclassified Traffic Using Automatic Clustering Techniques , 2011, TMA.

[10]  Anirban Mahanti,et al.  Traffic classification using clustering algorithms , 2006, MineNet '06.

[11]  Sebastian Zander,et al.  Automated traffic classification and application identification using machine learning , 2005, The IEEE Conference on Local Computer Networks 30th Anniversary (LCN'05)l.

[12]  Maurizio Dusi,et al.  Traffic classification through simple statistical fingerprinting , 2007, CCRV.

[13]  Jun Zhang,et al.  A novel semi-supervised approach for network traffic clustering , 2011, 2011 5th International Conference on Network and System Security.

[14]  James Newsome,et al.  Polygraph: automatically generating signatures for polymorphic worms , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[15]  Renata Teixeira,et al.  Traffic classification on the fly , 2006, CCRV.

[16]  Jun Zhang,et al.  Network Traffic Classification Using Correlation Information , 2013, IEEE Transactions on Parallel and Distributed Systems.

[17]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[18]  Carey L. Williamson,et al.  Offline/realtime traffic classification using semi-supervised learning , 2007, Perform. Evaluation.

[19]  Luca Salgarelli,et al.  Support Vector Machines for TCP traffic classification , 2009, Comput. Networks.

[20]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[21]  David G. Stork,et al.  Pattern Classification , 1973 .

[22]  Dario Rossi,et al.  KISS: Stochastic Packet Inspection Classifier for UDP Traffic , 2010, IEEE/ACM Transactions on Networking.

[23]  Grenville J. Armitage,et al.  Training on multiple sub-flows to optimise the use of Machine Learning classifiers in real-world IP networks , 2006, Proceedings. 2006 31st IEEE Conference on Local Computer Networks.

[24]  Carey L. Williamson,et al.  Identifying and discriminating between web and peer-to-peer traffic in the network core , 2007, WWW '07.

[25]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[26]  Yang Xiang,et al.  An automatic application signature construction system for unknown traffic , 2010 .

[27]  Stefan Savage,et al.  Unexpected means of protocol inference , 2006, IMC '06.