Method of data cleaning for network traffic classification

Abstract Network traffic classification aims at identifying the application types of network packets. It is important for Internet service providers (ISPs) to manage bandwidth resources and ensure the quality of service for different network applications. However, most classification techniques using machine learning only focus on high flow accuracy and ignore byte accuracy. The classifier would obtain low classification performance for elephant flows as the imbalance between elephant flows and mice flows on Internet. The elephant flows, however, consume much more bandwidth than mice flows. When the classifier is deployed for traffic policing, the network management system cannot penalize elephant flows and avoid network congestion effectively. This article explores the factors related to low byte accuracy, and secondly, it presents a new traffic classification method to improve byte accuracy at the aid of data cleaning. Experiments are carried out on three groups of real-world traffic datasets, and the method is compared with existing work on the performance of improving byte accuracy. Experiment shows that byte accuracy increased by about 22.31% on average. The method outperforms the existing one in most cases.

[1]  Riyad Alshammari,et al.  Can encrypted traffic be identified without port numbers, IP addresses and payload inspection? , 2011, Comput. Networks.

[2]  Ece Guran Schmidt,et al.  Machine learning algorithms for accurate flow-based network traffic classification: Evaluation and comparison , 2010, Perform. Evaluation.

[3]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[4]  Renata Teixeira,et al.  Traffic classification on the fly , 2006, CCRV.

[5]  Antonio Pescapè,et al.  Issues and future directions in traffic classification , 2012, IEEE Network.

[6]  Maurizio Dusi,et al.  Using GMM and SVM-Based Techniques for the Classification of SSH-Encrypted Traffic , 2009, 2009 IEEE International Conference on Communications.

[7]  Niccolo Cascarano,et al.  GT: picking up the truth from the ground for internet traffic , 2009, CCRV.

[8]  Judith Kelner,et al.  A Survey on Internet Traffic Identification , 2009, IEEE Communications Surveys & Tutorials.

[9]  Marco Canini,et al.  Efficient application identification and the temporal and spatial stability of classification schema , 2009, Comput. Networks.

[10]  Zhi-Li Zhang,et al.  A Modular Machine Learning System for Flow-Level Traffic Classification in Large Networks , 2012, TKDD.

[11]  Sebastian Zander,et al.  A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification , 2006, CCRV.

[12]  Sakir Sezer,et al.  Host-Based P2P Flow Identification and Use in Real-Time , 2011, TWEB.

[13]  Yanghee Choi,et al.  NeTraMark: a network traffic classification benchmark , 2011, CCRV.

[14]  Xiaohong Guan,et al.  An SVM-based machine learning method for accurate internet traffic classification , 2010, Inf. Syst. Frontiers.

[15]  Haitao He,et al.  Improve Flow Accuracy and Byte Accuracy in Network Traffic Classification , 2008, ICIC.

[16]  Jun Zhang,et al.  An Effective Network Traffic Classification Method with Unknown Flow Detection , 2013, IEEE Transactions on Network and Service Management.

[17]  Misha Denil,et al.  Overlap versus Imbalance , 2010, Canadian Conference on AI.