Evaluating machine learning algorithms for automated network application identification

The identification of network applications that create traffic flows is vital to the areas of network management and surveillance. Current popular methods such as port number and payload-based identification are inadequate and exhibit a number of shortfalls. A potential solution is the use of machine learning techniques to identify network applications based on payload independent statistical features. In this paper we evaluate and compare the efficiency and performance of different feature selection and machine learning techniques based on flow data obtained from a number of public traffic traces. We also provide insights into which flow features are the most useful. Furthermore, we investigate the influence of other factors such as flow timeout and size of the training data set. We find significant performance differences between different algorithms and identify several algorithms that provide accurate (up to 99% accuracy) and fast classification. Keywords—Traffic Classification, Machine Learning, Statistical Features I. INTRODUCTION There is a growing need for accurate and timely classification of network traffic flows for purposes such as trend analyses (estimating capacity demand trends for network planning), adaptive, network-based QoS marking of traffic, dynamic access control (adaptive firewalls that detect forbidden applications or attacks) or lawful interception. 'Classification' refers to the identification of an application or group of applications responsible for a traffic flow. Port-based classification is still widely practiced despite being only moderately accurate at best. It is expected to become less effective in the near future due to an ever-increasing number of network applications, extensive use of network address translation (NAT), dynamic port allocation and end-users deliberately choosing non-default ports. For example a large amount of peer-to-peer (p2p) file sharing traffic is found on non- default ports (1). Alternative solutions such as payload- based classification rely on specific application data (protocol decoding or signatures), making it difficult to

[1]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[2]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[3]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[4]  Matthew Roughan,et al.  Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification , 2004, IMC '04.

[5]  Michalis Faloutsos,et al.  BLINC: multilevel traffic classification in the dark , 2005, SIGCOMM '05.

[6]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[7]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[8]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[9]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[10]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[11]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[12]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[13]  Gavin C. Cawley Efficient Sequential Minimal Optimisation of Support Vector Classifiers , 2001 .

[14]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[15]  Remco R. Bouckaert,et al.  Bayesian network classifiers in Weka , 2004 .

[16]  Anthony McGregor,et al.  Flow Clustering Using Machine Learning Techniques , 2004, PAM.

[17]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[18]  Sebastian Zander,et al.  Automated traffic classification and application identification using machine learning , 2005, The IEEE Conference on Local Computer Networks 30th Anniversary (LCN'05)l.

[19]  R. Quinlan,et al.  Decision tree discovery , 1999 .

[20]  Michalis Faloutsos,et al.  Is P2P dying or just hiding? [P2P traffic measurement] , 2004, IEEE Global Telecommunications Conference, 2004. GLOBECOM '04..

[21]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..