A novel semi-supervised approach for network traffic clustering

Network traffic classification is an essential component for network management and security systems. To address the limitations of traditional port-based and payload-based methods, recent studies have been focusing on alternative approaches. One promising direction is applying machine learning techniques to classify traffic flows based on packet and flow level statistics. In particular, previous papers have illustrated that clustering can achieve high accuracy and discover unknown application classes. In this work, we present a novel semi-supervised learning method using constrained clustering algorithms. The motivation is that in network domain a lot of background information is available in addition to the data instances themselves. For example, we might know that flow ƒ1 and ƒ2 are using the same application protocol because they are visiting the same host address at the same port simultaneously. In this case, ƒ1 and ƒ2 shall be grouped into the same cluster ideally. Therefore, we describe these correlations in the form of pair-wise must-link constraints and incorporate them in the process of clustering. We have applied three constrained variants of the K-Means algorithm, which perform hard or soft constraint satisfaction and metric learning from constraints. A number of real-world traffic traces have been used to show the availability of constraints and to test the proposed approach. The experimental results indicate that by incorporating constraints in the course of clustering, the overall accuracy and cluster purity can be significantly improved.

[1]  Jason Lee,et al.  A first look at modern enterprise traffic , 2005, IMC '05.

[2]  Andrew W. Moore,et al.  Bayesian Neural Networks for Internet Traffic Classification , 2007, IEEE Transactions on Neural Networks.

[3]  Carey L. Williamson,et al.  Categories and Subject Descriptors: C.4 [Computer Systems Organization]Performance of Systems , 2022 .

[4]  Grenville J. Armitage,et al.  Training on multiple sub-flows to optimise the use of Machine Learning classifiers in real-world IP networks , 2006, Proceedings. 2006 31st IEEE Conference on Local Computer Networks.

[5]  Grenville J. Armitage,et al.  A survey of techniques for internet traffic classification using machine learning , 2008, IEEE Communications Surveys & Tutorials.

[6]  Jeffrey Erman,et al.  Internet Traffic Identification using Machine Learning , 2006 .

[7]  Wolfgang Mühlbauer,et al.  Digging into HTTPS: flow-based classification of webmail traffic , 2010, IMC '10.

[8]  Renata Teixeira,et al.  Traffic classification on the fly , 2006, CCRV.

[9]  Maurizio Dusi,et al.  Traffic classification through simple statistical fingerprinting , 2007, CCRV.

[10]  Stefan Savage,et al.  Unexpected means of protocol inference , 2006, IMC '06.

[11]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[12]  Dan Pelleg,et al.  K -Means with Large and Noisy Constraint Sets , 2007, ECML.

[13]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[14]  Guillaume Urvoy-Keller,et al.  Challenging statistical classification for operational usage: the ADSL case , 2009, IMC '09.

[15]  Matthew Roughan,et al.  Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification , 2004, IMC '04.

[16]  Sebastian Zander,et al.  Automated traffic classification and application identification using machine learning , 2005, The IEEE Conference on Local Computer Networks 30th Anniversary (LCN'05)l.

[17]  Ian Davidson,et al.  When Is Constrained Clustering Beneficial, and Why? , 2006, AAAI.

[18]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[19]  Marco Canini,et al.  Experience with high-speed automated application-identification for network-management , 2009, ANCS '09.

[20]  Anthony McGregor,et al.  Flow Clustering Using Machine Learning Techniques , 2004, PAM.

[21]  Anirban Mahanti,et al.  Traffic classification using clustering algorithms , 2006, MineNet '06.