Comprehensive Analysis of Network Traffic Data

With the large volume of network traffic flow, it is necessary to preprocess raw data before classification to gain the accurate results speedily. Feature selection is an essential approach in preprocessing phase. The Principal Component Analysis (PCA) is recognized as an effective and efficient method. In this paper, we classify network traffic by using the PCA technique together with six machine learning algorithms – Naive Bayes, Decision Tree, 1-Nearest Neighbor (NN), Random Forest, Support Vector Machine (SVM) and H2O. We analyze the impact of PCA through classifying the data set by each algorithm with and without PCA. Experiments are set out by varying the size of input data sets, and the performances are measured from two metrics including overall accuracy and F-measure. The computational time is also considered in analysis phase. Our results show that Random Forest and NN are the top two algorithms among the six. Specifically, both of them behave well in classification under the most cases of input sets regardless of applying PCA. Lastly, PCA significantly boosts NN algorithms in terms of classification accuracy and shortens the classification time for Random Forest.

[1]  Zahir Tari,et al.  Toward an efficient and scalable feature selection approach for internet traffic classification , 2013, Comput. Networks.

[2]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007 .

[3]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[4]  Tuomo Sipola,et al.  Adaptive framework for network traffic classification using dimensionality reduction and clustering , 2012, 2012 IV International Congress on Ultra Modern Telecommunications and Control Systems.

[5]  Jun Zhang,et al.  Network Traffic Classification Using Correlation Information , 2013, IEEE Transactions on Parallel and Distributed Systems.

[6]  Sebastian Zander,et al.  A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification , 2006, CCRV.

[7]  S. Maitra,et al.  Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression , 2008 .

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Mohammed Anbar,et al.  Network traffic classification — A comparative study of two common decision tree methods: C4.5 and Random forest , 2014, 2014 2nd International Conference on Electronic Design (ICED).

[10]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[11]  Xiaohong Guan,et al.  An SVM-based machine learning method for accurate internet traffic classification , 2010, Inf. Syst. Frontiers.

[12]  Brett Lantz,et al.  Machine learning with R : learn how to use R to apply powerful machine learning methods and gain an insight into real-world applications , 2013 .

[13]  Bin Zhang,et al.  IP traffic classification based on machine learning , 2011, 2011 IEEE 13th International Conference on Communication Technology.

[14]  Cristina Alcaraz,et al.  WASAM: A dynamic wide-area situational awareness model for critical domains in Smart Grids , 2014, Future Gener. Comput. Syst..

[15]  Rahul Khanna,et al.  Efficient Learning Machines , 2015, Apress.

[16]  Cristina Alcaraz,et al.  Context-Awareness Using Anomaly-Based Detectors for Smart Grid Domains , 2014, CRiSIS.

[17]  Stephen R. Marsland,et al.  Machine Learning - An Algorithmic Perspective , 2009, Chapman and Hall / CRC machine learning and pattern recognition series.

[18]  Fatin Zaklouta,et al.  Traffic sign classification using K-d trees and Random Forests , 2011, The 2011 International Joint Conference on Neural Networks.

[19]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[20]  P. Filzmoser,et al.  Algorithms for Projection-Pursuit Robust Principal Component Analysis , 2007 .

[21]  Peter Filzmoser,et al.  Robust feature selection and robust PCA for internet traffic anomaly detection , 2012, 2012 Proceedings IEEE INFOCOM.

[22]  Abdulhamit Subasi,et al.  EEG signal classification using PCA, ICA, LDA and support vector machines , 2010, Expert Syst. Appl..

[23]  Yanghee Choi,et al.  Internet traffic classification demystified: on the sources of the discriminative power , 2010, CoNEXT.