Internet traffic clustering with constraints

Due to the limitations of the traditional port-based and payload-based traffic classification approaches, the past decade has seen extensive work on utilizing machine learning techniques to classify network traffic based on packet and flow level features. In particular, previous studies have shown that the unsupervised clustering approach is both accurate and capable of discovering previously unknown application classes. In this paper, we explore the utility of side information in the process of traffic clustering. Specifically, we focus on the flow correlation information that can be efficiently extracted from packet headers and expressed as instance-level constraints, which indicate that particular sets of flows are using the same application and thus should be put into the same cluster. To incorporate the constraints, we propose a modified constrained K-Means algorithm. A variety of real-world traffic traces are used to show that the constraints are widely available. The experimental results indicate that the constrained approach not only improves the quality of the resulted clusters, but also speeds up the convergence of the clustering process.

[1]  Grenville J. Armitage,et al.  A survey of techniques for internet traffic classification using machine learning , 2008, IEEE Communications Surveys & Tutorials.

[2]  Michalis Faloutsos,et al.  Internet traffic classification demystified: myths, caveats, and the best practices , 2008, CoNEXT '08.

[3]  Anirban Mahanti,et al.  Traffic classification using clustering algorithms , 2006, MineNet '06.

[4]  Maurizio Dusi,et al.  Traffic classification through simple statistical fingerprinting , 2007, CCRV.

[5]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[6]  Guillaume Urvoy-Keller,et al.  Challenging statistical classification for operational usage: the ADSL case , 2009, IMC '09.

[7]  Sebastian Zander,et al.  Automated traffic classification and application identification using machine learning , 2005, The IEEE Conference on Local Computer Networks 30th Anniversary (LCN'05)l.

[8]  Matthew Roughan,et al.  Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification , 2004, IMC '04.

[9]  Renata Teixeira,et al.  Traffic classification on the fly , 2006, CCRV.

[10]  Andrew W. Moore,et al.  Bayesian Neural Networks for Internet Traffic Classification , 2007, IEEE Transactions on Neural Networks.

[11]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[12]  Jason Lee,et al.  A first look at modern enterprise traffic , 2005, IMC '05.

[13]  Carey L. Williamson,et al.  Categories and Subject Descriptors: C.4 [Computer Systems Organization]Performance of Systems , 2022 .

[14]  Grenville J. Armitage,et al.  Training on multiple sub-flows to optimise the use of Machine Learning classifiers in real-world IP networks , 2006, Proceedings. 2006 31st IEEE Conference on Local Computer Networks.

[15]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[16]  Yanghee Choi,et al.  Internet traffic classification demystified: on the sources of the discriminative power , 2010, CoNEXT.

[17]  Anthony McGregor,et al.  Flow Clustering Using Machine Learning Techniques , 2004, PAM.

[18]  Marco Canini,et al.  Experience with high-speed automated application-identification for network-management , 2009, ANCS '09.

[19]  Stefan Savage,et al.  Unexpected means of protocol inference , 2006, IMC '06.

[20]  Michalis Faloutsos,et al.  BLINC: multilevel traffic classification in the dark , 2005, SIGCOMM '05.

[21]  Ian Davidson,et al.  When Is Constrained Clustering Beneficial, and Why? , 2006, AAAI.

[22]  Jeffrey Erman,et al.  Internet Traffic Identification using Machine Learning , 2006 .

[23]  Wolfgang Mühlbauer,et al.  Digging into HTTPS: flow-based classification of webmail traffic , 2010, IMC '10.

[24]  Jun Zhang,et al.  A novel semi-supervised approach for network traffic clustering , 2011, 2011 5th International Conference on Network and System Security.

[25]  Dan Pelleg,et al.  K -Means with Large and Noisy Constraint Sets , 2007, ECML.