Active Learning for Network Traffic Classification: A Technical Study

[Note: This work has been submitted to the IEEE Transactions on Cognitive Communications and Networking journal for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible ] Abstract—Network Traffic Classification (NTC) has become an important component in a wide variety of network management operations, e.g., Quality of Service (QoS) provisioning and security purposes. Machine Learning (ML) algorithms as a common approach for NTC methods can achieve reasonable accuracy and handle encrypted traffic. However, ML-based NTC techniques suffer from the shortage of labeled traffic data which is the case in many real-world applications. This study investigates the applicability of an active form of ML, called Active Learning (AL), which reduces the need for a high number of labeled examples by actively choosing the instances that should be labeled. The study first provides an overview of NTC and its fundamental challenges along with surveying the literature in the field of using ML techniques in NTC. Then, it introduces the concepts of AL, discusses it in the context of NTC, and review the literature in this field. Further, challenges and open issues in the use of AL for NTC are discussed. Additionally, as a technical survey, some experiments are conducted to show the broad applicability of AL in NTC. The simulation results show that AL can achieve high accuracy with a small amount of data.

[1]  Nino Vincenzo Verde,et al.  Analyzing Android Encrypted Network Traffic to Identify User Actions , 2016, IEEE Transactions on Information Forensics and Security.

[2]  Luis Hernández-Callejo,et al.  Ensemble network traffic classification: Algorithm comparison and novel ensemble scheme proposal , 2017, Comput. Networks.

[3]  Xiaodong Lin,et al.  Active Learning From Stream Data Using Optimal Weight Classifier Ensemble , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  Hwee Pink Tan,et al.  Mobile big data analytics using deep learning and apache spark , 2016, IEEE Network.

[5]  Raouf Boutaba,et al.  Machine Learning for Cognitive Network Management , 2018, IEEE Communications Magazine.

[6]  Amirhosein Taherkordi,et al.  Deep Learning for Network Traffic Monitoring and Analysis (NTMA): A Survey , 2021, Comput. Commun..

[7]  Eduardo Veas,et al.  Active learning approach to label network traffic datasets , 2019, J. Inf. Secur. Appl..

[8]  Evangelos Pallis,et al.  A Survey on the Internet of Things (IoT) Forensics: Challenges, Approaches, and Open Issues , 2020, IEEE Communications Surveys & Tutorials.

[9]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[10]  Lili Yin,et al.  Incorporate active learning to semi-supervised industrial fault classification , 2019, Journal of Process Control.

[11]  Xin Liu,et al.  How to Achieve High Classification Accuracy with Just a Few Labels: A Semi-supervised Approach Using Sampled Packets , 2018, ICDM.

[12]  Pedro Casas,et al.  ADAM & RAL: Adaptive Memory Learning and Reinforcement Active Learning for Network Monitoring , 2019, 2019 15th International Conference on Network and Service Management (CNSM).

[13]  Kai Yang,et al.  Tripartite Active Learning for Interactive Anomaly Discovery , 2019, IEEE Access.

[14]  Chin-Wei Chen,et al.  Malware Family Classification using Active Learning by Learning , 2020, 2020 22nd International Conference on Advanced Communication Technology (ICACT).

[15]  Zhixin Sun,et al.  A Survey of Techniques for Mobile Service Encrypted Traffic Classification Using Deep Learning , 2019, IEEE Access.

[16]  Yolande Belaïd,et al.  An adaptive streaming active learning strategy based on instance weighting , 2016, Pattern Recognit. Lett..

[17]  Grenville J. Armitage,et al.  A survey of techniques for internet traffic classification using machine learning , 2008, IEEE Communications Surveys & Tutorials.

[18]  Mark Craven,et al.  Curious machines: active learning with structured instances , 2008 .

[19]  Toon Calders,et al.  Data preprocessing techniques for classification without discrimination , 2011, Knowledge and Information Systems.

[20]  Andreas Hotho,et al.  A Survey of Network-based Intrusion Detection Data Sets , 2019, Comput. Secur..

[21]  Maurizio Dusi,et al.  Traffic classification through simple statistical fingerprinting , 2007, CCRV.

[22]  Muhammad Ali Imran,et al.  Cell Coverage Degradation Detection Using Deep Learning Techniques , 2018, 2018 International Conference on Information and Communication Technology Convergence (ICTC).

[23]  Tianjian Chen,et al.  Federated Machine Learning: Concept and Applications , 2019 .

[24]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[25]  K. Vijay-Shanker,et al.  A Method for Stopping Active Learning Based on Stabilizing Predictions and the Need for User-Adjustable Stopping , 2009, CoNLL.

[26]  Jérôme François,et al.  A multi-level framework to identify HTTPS services , 2016, NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium.

[27]  Nivio Ziviani,et al.  Deep Active Learning for Anomaly Detection , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[28]  Pascal Fua,et al.  Learning Active Learning from Data , 2017, NIPS.

[29]  Jaime Lloret,et al.  Network Traffic Classifier With Convolutional and Recurrent Neural Networks for Internet of Things , 2017, IEEE Access.

[30]  Antonio Pescapè,et al.  Traffic Classification of Mobile Apps through Multi-Classification , 2017, GLOBECOM 2017 - 2017 IEEE Global Communications Conference.

[31]  Bartosz Krawczyk,et al.  Combining active learning with concept drift detection for data stream mining , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[32]  Jaime G. Carbonell,et al.  Proactive learning: cost-sensitive active learning with multiple imperfect oracles , 2008, CIKM '08.

[33]  Guang Cheng,et al.  Instagram User Behavior Identification Based on Multidimensional Features , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[34]  K. P. Soman,et al.  Deep Learning Approach for Intelligent Intrusion Detection System , 2019, IEEE Access.

[35]  Jiang Wang,et al.  Feedback-driven multiclass active learning for data streams , 2013, CIKM.

[36]  Shi Dong,et al.  Multi class SVM algorithm with active learning for network traffic classification , 2021, Expert Syst. Appl..

[37]  Yoojae Won,et al.  Analysis of operating system identification via fingerprinting and machine learning , 2019, Comput. Electr. Eng..

[38]  Hai-cheng Li,et al.  Quick traffic classification of BT based on its handshake packets , 2011, 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC).

[39]  Burr Settles,et al.  From Theories to Queries: Active Learning in Practice , 2011 .

[40]  Song Guo,et al.  Machine Fault Detection for Intelligent Self-Driving Networks , 2020, IEEE Communications Magazine.

[41]  Adam Doupé,et al.  Deep Android Malware Detection , 2017, CODASPY.

[42]  Kok-Lim Alvin Yau,et al.  Reinforcement learning for context awareness and intelligence in wireless networks: Review, new features and open issues , 2012, J. Netw. Comput. Appl..

[43]  Hardeep Singh,et al.  Performance Analysis of Unsupervised Machine Learning Techniques for Network Traffic Classification , 2015, 2015 Fifth International Conference on Advanced Computing & Communication Technologies.

[44]  Hang Zhang,et al.  Online Active Learning Paired Ensemble for Concept Drift and Class Imbalance , 2018, IEEE Access.

[45]  Gökhan Tür,et al.  An active approach to spoken language processing , 2006, TSLP.

[46]  Eduardo Rocha,et al.  A Survey of Payload-Based Traffic Classification Approaches , 2014, IEEE Communications Surveys & Tutorials.

[47]  Huan Liu,et al.  Feature Engineering for Machine Learning and Data Analytics , 2018 .

[48]  Liu Yang,et al.  Active Learning with a Drifting Distribution , 2011, NIPS.

[49]  Zigang Cao,et al.  Classifying User Activities in the Encrypted WeChat Traffic , 2018, 2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC).

[50]  R. Warner Applied Statistics: From Bivariate through Multivariate Techniques [with CD-ROM]. , 2007 .

[51]  Paul N. Bennett,et al.  Dual Strategy Active Learning , 2007, ECML.

[52]  Xin Liu,et al.  Deep Learning for Encrypted Traffic Classification: An Overview , 2018, IEEE Communications Magazine.

[53]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[54]  Dragoljub Pokrajac,et al.  Outlier Detection with Globally Optimal Exemplar-Based GMM , 2009, SDM.

[55]  Luiz Eduardo Soares de Oliveira,et al.  Toward a reliable anomaly-based intrusion detection in real-world environments , 2017, Comput. Networks.

[56]  Ali A. Ghorbani,et al.  Characterization of Encrypted and VPN Traffic using Time-related Features , 2016, ICISSP.

[57]  Ali A. Ghorbani,et al.  Characterization of Tor Traffic using Time based Features , 2017, ICISSP.

[58]  Øystein Haugen,et al.  Boosting algorithms for network intrusion detection: A comparative evaluation of Real AdaBoost, Gentle AdaBoost and Modest AdaBoost , 2020, Eng. Appl. Artif. Intell..

[59]  Nor Badrul Anuar,et al.  The rise of traffic classification in IoT networks: A survey , 2020, J. Netw. Comput. Appl..

[60]  Jianfeng Lu,et al.  Active learning via query synthesis and nearest neighbour search , 2015, Neurocomputing.

[61]  Ron Artstein Inter-Coder Agreement for Computational Linguistics , 2008 .

[62]  Michael Bloodgood,et al.  Stopping Active Learning Based on Predicted Change of F Measure for Text Classification , 2019, 2019 IEEE 13th International Conference on Semantic Computing (ICSC).

[63]  San-Min Liu,et al.  Active learning for P2P traffic identification , 2014, Peer-to-Peer Networking and Applications.

[64]  Qi Hao,et al.  Deep Learning for Intelligent Wireless Networks: A Comprehensive Survey , 2018, IEEE Communications Surveys & Tutorials.

[65]  Mugen Peng,et al.  Application of Machine Learning in Wireless Networks: Key Techniques and Open Issues , 2018, IEEE Communications Surveys & Tutorials.

[66]  Rentao Gu,et al.  Machine Learning for Intelligent Optical Networks: A Comprehensive Survey , 2020, J. Netw. Comput. Appl..

[67]  Eulanda Miranda dos Santos,et al.  A Drift Detection Method Based on Active Learning , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[68]  Zhihui Li,et al.  A Survey of Deep Active Learning , 2020, ACM Comput. Surv..

[69]  Joachim Denzler,et al.  Selecting Influential Examples: Active Learning with Expected Model Output Changes , 2014, ECCV.

[70]  T. L. Dirkse Active learning for multi-target regression problems, with an application to meta-modeling of transportation simulations , 2019 .

[71]  Samuel Marchal,et al.  DÏoT: A Federated Self-learning Anomaly Detection System for IoT , 2018, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[72]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[73]  Martin Jaggi,et al.  Federated Learning for Malware Detection in IoT Devices , 2022, Computer Networks.

[74]  Kai Yang,et al.  Active Learning for Wireless IoT Intrusion Detection , 2018, IEEE Wireless Communications.

[75]  Yuval Elovici,et al.  ALDOCX: Detection of Unknown Malicious Microsoft Office Documents Using Designated Active Learning Methods Based on New Structural Feature Extraction Methodology , 2017, IEEE Transactions on Information Forensics and Security.

[76]  Mohsen Imani,et al.  Mockingbird: Defending Against Deep-Learning-Based Website Fingerprinting Attacks With Adversarial Traces , 2019, IEEE Transactions on Information Forensics and Security.

[77]  Xiaolong Huang,et al.  Network Intrusion Detection Based on an Improved Long-Short-Term Memory Model in Combination with Multiple Spatiotemporal Structures , 2021, Wirel. Commun. Mob. Comput..

[78]  Xin Wang,et al.  Real Network Traffic Collection and Deep Learning for Mobile App Identification , 2020, Wirel. Commun. Mob. Comput..

[79]  Rami Puzis,et al.  Transfer Learning for User Action Identication in Mobile Apps via Encrypted Trafc Analysis , 2018, IEEE Intelligent Systems.

[80]  Pavel Celeda,et al.  Passive os fingerprinting methods in the jungle of wireless networks , 2018, NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium.

[81]  LiangMin Wang,et al.  RBP: a website fingerprinting obfuscation method against intelligent fingerprinting attacks , 2021, Journal of Cloud Computing.

[82]  Wei Lin,et al.  Traffic Identification of Mobile Apps Based on Variational Autoencoder Network , 2017, 2017 13th International Conference on Computational Intelligence and Security (CIS).

[83]  Lili Yin,et al.  Active learning based support vector data description method for robust novelty detection , 2018, Knowl. Based Syst..

[84]  Dirk Grunwald,et al.  Legal issues surrounding monitoring during network research , 2007, IMC '07.

[85]  Nadeem Javaid,et al.  Fault Detection in Wireless Sensor Networks through the Random Forest Classifier , 2019, Sensors.

[86]  Jun Zhang,et al.  Internet Traffic Classification Using Constrained Clustering , 2014, IEEE Transactions on Parallel and Distributed Systems.

[87]  Sanjoy Dasgupta,et al.  Analysis of a greedy active learning strategy , 2004, NIPS.

[88]  Jugal Kalita,et al.  Active learning to detect DDoS attack using ranked features , 2019, Comput. Commun..

[89]  Anis Yazidi,et al.  A Machine-Learning-Based Tool for Passive OS Fingerprinting With TCP Variant as a Novel Feature , 2021, IEEE Internet of Things Journal.

[90]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Effect of label noise in the complexity of classification problems , 2015, Neurocomputing.

[91]  Pavel Celeda,et al.  A survey of methods for encrypted traffic classification and analysis , 2015, Int. J. Netw. Manag..

[92]  Robert F. Murphy,et al.  Deciding when to stop: Efficient stopping of active learning guided drug-target prediction , 2015, ArXiv.

[93]  Sattar Hashemi,et al.  AdaWFPA: Adaptive Online Website Fingerprinting Attack for Tor Anonymous Network: A Stream-wise Paradigm , 2019, Comput. Commun..

[94]  Dule Shu,et al.  Generative adversarial attacks against intrusion detection systems using active learning , 2020, WiseML@WiSec.

[95]  Cedric Baudoin,et al.  Towards the Deployment of Machine Learning Solutions in Network Traffic Classification: A Systematic Survey , 2019, IEEE Communications Surveys & Tutorials.

[96]  P Ravi Kiran Varma,et al.  A semi-supervised intrusion detection system using active learning SVM and fuzzy c-means clustering , 2017, 2017 International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC).

[97]  Ayman I. Kayssi,et al.  Mobile Apps identification based on network flows , 2018, Knowledge and Information Systems.

[98]  Philip S. Yu,et al.  Active Learning: A Survey , 2014, Data Classification: Algorithms and Applications.

[99]  Manuel López Martín,et al.  IoT type-of-traffic forecasting method based on gradient boosting neural networks , 2020, Future Gener. Comput. Syst..

[100]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[101]  Marco Mellia,et al.  A Survey on Big Data for Network Traffic Monitoring and Analysis , 2019, IEEE Transactions on Network and Service Management.

[102]  Alagan Anpalagan,et al.  Tor Traffic Classification from Raw Packet Header using Convolutional Neural Network , 2018, 2018 1st IEEE International Conference on Knowledge Innovation and Invention (ICKII).

[103]  F. Richard Yu,et al.  A Survey of Machine Learning Techniques Applied to Software Defined Networking (SDN): Research Issues and Challenges , 2019, IEEE Communications Surveys & Tutorials.

[104]  Raouf Boutaba,et al.  A comprehensive survey on machine learning for networking: evolution, applications and research opportunities , 2018, Journal of Internet Services and Applications.

[105]  Marco Mellia,et al.  Big-DAMA: Big Data Analytics for Network Traffic Monitoring and Analysis , 2016, LANCOMM@SIGCOMM.

[106]  Xiaohong Guan,et al.  An SVM-based machine learning method for accurate internet traffic classification , 2010, Inf. Syst. Frontiers.

[107]  Ursula Challita,et al.  Artificial Neural Networks-Based Machine Learning for Wireless Networks: A Tutorial , 2017, IEEE Communications Surveys & Tutorials.

[108]  Qi Shi,et al.  A Deep Learning Approach to Network Intrusion Detection , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[109]  Witold Pedrycz,et al.  Network traffic classification for data fusion: A survey , 2021, Inf. Fusion.

[110]  Francis Bach,et al.  ILAB: An Interactive Labelling Strategy for Intrusion Detection , 2017, RAID.

[111]  Antonio Pescapè,et al.  Issues and future directions in traffic classification , 2012, IEEE Network.

[112]  Colin J. Fidge,et al.  A Comparison of Supervised Machine Learning Algorithms for Classification of Communications Network Traffic , 2017, ICONIP.

[113]  Tim Menzies,et al.  Better Data Labelling With EMBLEM (and how that Impacts Defect Prediction) , 2019, IEEE Transactions on Software Engineering.

[114]  Geoff Holmes,et al.  Active Learning With Drifting Streaming Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[115]  Wenke Lee,et al.  Misleading worm signature generators using deliberate noise injection , 2006, 2006 IEEE Symposium on Security and Privacy (S&P'06).