Internet Traffic Classification Using Constrained Clustering

Statistics-based Internet traffic classification using machine learning techniques has attracted extensive research interest lately, because of the increasing ineffectiveness of traditional port-based and payload-based approaches. In particular, unsupervised learning, that is, traffic clustering, is very important in real-life applications, where labeled training data are difficult to obtain and new patterns keep emerging. Although previous studies have applied some classic clustering algorithms such as K-Means and EM for the task, the quality of resultant traffic clusters was far from satisfactory. In order to improve the accuracy of traffic clustering, we propose a constrained clustering scheme that makes decisions with consideration of some background information in addition to the observed traffic statistics. Specifically, we make use of equivalence set constraints indicating that particular sets of flows are using the same application layer protocols, which can be efficiently inferred from packet headers according to the background knowledge of TCP/IP networking. We model the observed data and constraints using Gaussian mixture density and adapt an approximate algorithm for the maximum likelihood estimation of model parameters. Moreover, we study the effects of unsupervised feature discretization on traffic clustering by using a fundamental binning method. A number of real-world Internet traffic traces have been used in our evaluation, and the results show that the proposed approach not only improves the quality of traffic clusters in terms of overall accuracy and per-class metrics, but also speeds up the convergence.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[5]  Tomer Hertz,et al.  Computing Gaussian Mixture Models with EM Using Equivalence Constraints , 2003, NIPS.

[6]  Andrew B. Nobel,et al.  Statistical Clustering of Internet Communication Patterns , 2003 .

[7]  Anthony McGregor,et al.  Flow Clustering Using Machine Learning Techniques , 2004, PAM.

[8]  Matthew Roughan,et al.  Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification , 2004, IMC '04.

[9]  Sebastian Zander,et al.  Automated traffic classification and application identification using machine learning , 2005, The IEEE Conference on Local Computer Networks 30th Anniversary (LCN'05)l.

[10]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[11]  Jeffrey Erman,et al.  Internet Traffic Identification using Machine Learning , 2006 .

[12]  Sebastian Zander,et al.  A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification , 2006, CCRV.

[13]  Ian Davidson,et al.  When Is Constrained Clustering Beneficial, and Why? , 2006, AAAI.

[14]  Renata Teixeira,et al.  Traffic classification on the fly , 2006, CCRV.

[15]  Anirban Mahanti,et al.  Traffic classification using clustering algorithms , 2006, MineNet '06.

[16]  Andrew W. Moore,et al.  Bayesian Neural Networks for Internet Traffic Classification , 2007, IEEE Transactions on Neural Networks.

[17]  Renata Teixeira,et al.  Early Recognition of Encrypted Applications , 2007, PAM.

[18]  Carey L. Williamson,et al.  Categories and Subject Descriptors: C.4 [Computer Systems Organization]Performance of Systems , 2022 .

[19]  Maurizio Dusi,et al.  Traffic classification through simple statistical fingerprinting , 2007, CCRV.

[20]  Grenville J. Armitage,et al.  A survey of techniques for internet traffic classification using machine learning , 2008, IEEE Communications Surveys & Tutorials.

[21]  Luca Salgarelli,et al.  On the stability of the information carried by traffic flow features at the packet level , 2009, CCRV.

[22]  Shunzheng Yu,et al.  Supervised Learning Real-time Traffic Classifiers , 2009, J. Networks.

[23]  Guillaume Urvoy-Keller,et al.  Challenging statistical classification for operational usage: the ADSL case , 2009, IMC '09.

[24]  Luca Salgarelli,et al.  Support Vector Machines for TCP traffic classification , 2009, Comput. Networks.

[25]  Wolfgang Mühlbauer,et al.  Digging into HTTPS: flow-based classification of webmail traffic , 2010, IMC '10.

[26]  Yanghee Choi,et al.  Internet traffic classification demystified: on the sources of the discriminative power , 2010, CoNEXT.

[27]  Sebastian Zander,et al.  Practical machine learning based multimedia traffic classification for distributed QoS management , 2011, 2011 IEEE 36th Conference on Local Computer Networks.

[28]  Jun Zhang,et al.  A novel semi-supervised approach for network traffic clustering , 2011, 2011 5th International Conference on Network and System Security.

[29]  Sebastian Zander,et al.  Timely and Continuous Machine-Learning-Based Classification for Interactive IP Traffic , 2012, IEEE/ACM Transactions on Networking.

[30]  Jun Zhang,et al.  Network Traffic Classification Using Correlation Information , 2013, IEEE Transactions on Parallel and Distributed Systems.