An efficient feature generation approach based on deep learning and feature selection techniques for traffic classification

Abstract Substantial recent efforts have been made on the application of Machine Learning (ML) techniques to flow statistical features for traffic classification. However, the classification performance of ML techniques is severely degraded due to the high dimensionality and redundancy of flow statistical features, the imbalance in the number of traffic flows and concept drift of Internet traffic. With the aim of comprehensively solving these problems, this paper proposes a new feature optimization approach based on deep learning and Feature Selection (FS) techniques to provide the optimal and robust features for traffic classification. Firstly, symmetric uncertainty is exploited to remove the irrelevant features in network traffic data sets, then a feature generation model based on deep learning is applied to these relevant features for dimensionality reduction and feature generation, finally Weighted Symmetric Uncertainty (WSU) is exploited to select the optimal features by removing the redundant ones. Based on real traffic traces, experimental results show that the proposed approach can not only efficiently reduce the dimension of feature space, but also overcome the negative impacts of multi-class imbalance and concept drift problems on ML techniques. Furthermore, compared with the approaches used in the previous works, the proposed approach achieves the best classification performance and relatively higher runtime performance.

[1]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[2]  Ece Guran Schmidt,et al.  Machine learning algorithms for accurate flow-based network traffic classification: Evaluation and comparison , 2010, Perform. Evaluation.

[3]  Jing Liu,et al.  Exploiting unlabeled data to improve peer-to-peer traffic classification using incremental tri-training method , 2009, Peer-to-Peer Netw. Appl..

[4]  Yanghee Choi,et al.  Internet traffic classification demystified: on the sources of the discriminative power , 2010, CoNEXT.

[5]  Jun Zhang,et al.  Internet traffic clustering with side information , 2014, J. Comput. Syst. Sci..

[6]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[7]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[8]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[9]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[10]  Huan Liu,et al.  A selective sampling approach to active feature selection , 2004, Artif. Intell..

[11]  Sebastian Zander,et al.  A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification , 2006, CCRV.

[12]  Dario Rossi,et al.  Abacus: Accurate behavioral classification of P2P-TV traffic , 2011, Comput. Networks.

[13]  Abhijit S. Pandya,et al.  Feature selection with biased sample distributions , 2009, 2009 IEEE International Conference on Information Reuse & Integration.

[14]  B. Brown,et al.  Concepts and Techniques , 1983 .

[15]  Andrew W. Moore,et al.  Bayesian Neural Networks for Internet Traffic Classification , 2007, IEEE Transactions on Neural Networks.

[16]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[17]  William H. Press,et al.  Numerical recipes in C , 2002 .

[18]  Zhen Liu,et al.  Large traffic flows classification method , 2014, 2014 IEEE International Conference on Communications Workshops (ICC).

[19]  Jun Zhang,et al.  Internet Traffic Classification by Aggregating Correlated Naive Bayes Predictions , 2023, IEEE Transactions on Information Forensics and Security.

[20]  Zahir Tari,et al.  An optimal and stable feature selection approach for traffic classification based on multi-criterion fusion , 2014, Future Gener. Comput. Syst..

[21]  Konstantina Papagiannaki,et al.  Toward the Accurate Identification of Network Applications , 2005, PAM.

[22]  Niccolo Cascarano,et al.  GT: picking up the truth from the ground for internet traffic , 2009, CCRV.

[23]  Geoffrey E. Hinton,et al.  3D Object Recognition with Deep Belief Nets , 2009, NIPS.

[24]  Zhen Liu,et al.  Classifying imbalanced Internet traffic based PCDD: a per concept drift detection method , 2013, Smart Comput. Rev..

[25]  Zhen Liu,et al.  A comparison of improving multi-class imbalance for internet traffic classification , 2014, Inf. Syst. Frontiers.

[26]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[27]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[28]  Haitao He,et al.  Improve Flow Accuracy and Byte Accuracy in Network Traffic Classification , 2008, ICIC.

[29]  Yan Liu,et al.  Discriminative deep belief networks for visual data classification , 2011, Pattern Recognit..

[30]  Jun Zhang,et al.  Unsupervised traffic classification using flow statistical properties and IP packet payload , 2013, J. Comput. Syst. Sci..

[31]  N. F. F. Ebecken,et al.  On extending F-measure and G-mean metrics to multi-class problems , 2005, Data Mining VI.

[32]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[33]  Chao-Ton Su,et al.  An Extended Chi2 Algorithm for Discretization of Real Value Attributes , 2005, IEEE Trans. Knowl. Data Eng..

[34]  Andrew W. Moore,et al.  Discriminators for use in flow-based classification , 2013 .

[35]  Xiaohong Guan,et al.  An SVM-based machine learning method for accurate internet traffic classification , 2010, Inf. Syst. Frontiers.

[36]  Fang Hao,et al.  Fast Multiset Membership Testing Using Combinatorial Bloom Filters , 2009, IEEE INFOCOM 2009.

[37]  Grenville J. Armitage,et al.  A survey of techniques for internet traffic classification using machine learning , 2008, IEEE Communications Surveys & Tutorials.

[38]  Michalis Faloutsos,et al.  Internet traffic classification demystified: myths, caveats, and the best practices , 2008, CoNEXT '08.

[39]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[40]  A. Bifet,et al.  Early Drift Detection Method , 2005 .

[41]  Gang Lu,et al.  Feature selection for optimizing traffic classification , 2012, Comput. Commun..

[42]  Zhi-Li Zhang,et al.  A Modular Machine Learning System for Flow-Level Traffic Classification in Large Networks , 2012, TKDD.

[43]  Jing Liu,et al.  Learning on Class Imbalanced Data to Classify Peer-to-Peer Applications in IP Traffic using Resampling Techniques , 2009, 2009 International Joint Conference on Neural Networks.