Internet Traffic Identification using Machine Learning

We apply an unsupervised machine learning ap- proach for Internet traffic identification and compare the results with that of a previously applied supervised machine learning approach. Our unsupervised approach uses an Expectation Max- imization (EM) based clustering algorithm and the supervised approach uses the NaBayes classifier. We find the unsu- pervised clustering technique has an accuracy up to 91% and outperform the supervised technique by up to 9%. We also find that the unsupervised technique can be used to discover traffic from previously unknown applications and has the potential to become an excellent tool for exploring Internet traffic. I. INTRODUCTION Accurate classification of Internet traffic is important in many areas such as network design, network management, and network security. One key challenge in this area is to adapt to the dynamic nature of Internet traffic. Increasingly, new applications are being deployed on the Internet; some new applications such as peer-to-peer (P2P) file sharing and online gaming are becoming popular. With the evolution of Internet traffic, both in terms of number and type of applications, however, traditional classification techniques such as those based on well-known port numbers or packet payload analysis are either no longer effective for all types of network traffic or are otherwise unable to deploy because of privacy or security concerns for the data. A promising approach that has recently received some attention is traffic classification using machine learning tech- niques (1)-(4). These approaches assume that the applications typically send data in some sort of pattern; these patterns can be used as a means of identification which would allow the connections to be classified by traffic class. To find these patterns, flow statistics (such as mean packet size, flow length, and total number of packets) available using only TCP/IP headers are needed. This allows the classification technique to avoid the use of port numbers and packet payload information in the classification process. In this paper, we apply an unsupervised learning technique (EM clustering) for the Internet traffic classification problem and compare the results with that of a previously applied supervised machine learning approach. The unsupervised clus- tering approach uses an Expectation Maximization (EM) algo- rithm (5) that is different in that it classifies unlabeled training data into groups called "clusters" based on similarity. The NaBayes classifier has been previously shown to have high accuracy for Internet traffic classification (2). In parallel work, Zander et al. focus on using the EM clustering approach to build the classification model (4). We complement their work by using the EM clustering approach to build a classifier and show that this classifier outperforms the Na¨ Bayes classifier in terms of classification accuracy. We also analyze the time required to build the classification models for both approaches as a function of the size of the training data set. We also explore the clusters found by the EM approach and find that the majority of the connections are in a subset of the total clusters. The rest of this paper is organized as follows. Section II presents related work. In Section III, the background on the algorithms used in the Na¨ive Bayes and EM clustering approaches are covered. In Section IV, we introduce the data sets used in our work and present our experimental results. Section V discusses the advantages and disadvantages of the approaches. Section VI presents our conclusions and describes future work avenues.

[1]  Sebastian Zander,et al.  Automated traffic classification and application identification using machine learning , 2005, The IEEE Conference on Local Computer Networks 30th Anniversary (LCN'05)l.

[2]  Anthony McGregor,et al.  Flow Clustering Using Machine Learning Techniques , 2004, PAM.

[3]  Oliver Spatscheck,et al.  Accurate, scalable in-network identification of p2p traffic using application signatures , 2004, WWW '04.

[4]  Michalis Faloutsos,et al.  Transport layer identification of P2P traffic , 2004, IMC '04.

[5]  Sebastian Zander,et al.  Self-Learning IP Traffic Classification Based on Statistical Flow Characteristics , 2005, PAM.

[6]  Konstantina Papagiannaki,et al.  Toward the Accurate Identification of Network Applications , 2005, PAM.

[7]  Anirban Mahanti,et al.  Traffic classification using clustering algorithms , 2006, MineNet '06.

[8]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[9]  Vern Paxson,et al.  Empirically derived analytic models of wide-area TCP connections , 1994, TNET.

[10]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[11]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[12]  Ian Witten,et al.  Data Mining , 2000 .

[13]  Michalis Faloutsos,et al.  BLINC: multilevel traffic classification in the dark , 2005, SIGCOMM '05.

[14]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[15]  Patrick Haffner,et al.  ACAS: automated construction of application signatures , 2005, MineNet '05.

[16]  Matthew Roughan,et al.  Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification , 2004, IMC '04.