A Survey on Big Data for Network Traffic Monitoring and Analysis

Network Traffic Monitoring and Analysis (NTMA) represents a key component for network management, especially to guarantee the correct operation of large-scale networks such as the Internet. As the complexity of Internet services and the volume of traffic continue to increase, it becomes difficult to design scalable NTMA applications. Applications such as traffic classification and policing require real-time and scalable approaches. Anomaly detection and security mechanisms require to quickly identify and react to unpredictable events while processing millions of heterogeneous events. At last, the system has to collect, store, and process massive sets of historical data for post-mortem analysis. Those are precisely the challenges faced by general big data approaches: Volume, Velocity, Variety, and Veracity. This survey brings together NTMA and big data. We catalog previous work on NTMA that adopt big data approaches to understand to what extent the potential of big data is being explored in NTMA. This survey mainly focuses on approaches and technologies to manage the big NTMA data, additionally briefly discussing big data analytics (e.g., machine learning) for the sake of NTMA. Finally, we provide guidelines for future work, discussing lessons learned, and research directions.

[1]  Said Jai-Andaloussi,et al.  Toward a cloud-based security intelligence with big data processing , 2016, NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium.

[2]  Alessandro D'Alconzo,et al.  Call Detail Records for Human Mobility Studies: Taking Stock of the Situation in the "Always Connected Era" , 2017, Big-DAMA@SIGCOMM.

[3]  Yi Li,et al.  In a World That Counts: Clustering and Detecting Fake Social Engagement at Scale , 2015, WWW.

[4]  Xenofontas A. Dimitropoulos,et al.  pcapIndex: an index for network packet traces with legacy compatibility , 2012, CCRV.

[5]  Sanjeev Kumar,et al.  Finding a Needle in Haystack: Facebook's Photo Storage , 2010, OSDI.

[6]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Martino Trevisan,et al.  Towards web service classification using addresses and DNS , 2016, 2016 International Wireless Communications and Mobile Computing Conference (IWCMC).

[8]  Youngseok Lee,et al.  Toward scalable internet traffic measurement and analysis with Hadoop , 2013, CCRV.

[9]  Judith Kelner,et al.  A Survey on Internet Traffic Identification , 2009, IEEE Communications Surveys & Tutorials.

[10]  Anja Feldmann,et al.  Enriching network security analysis with time travel , 2008, SIGCOMM '08.

[11]  Aiko Pras,et al.  Flow Monitoring Explained: From Packet Capture to Data Analysis With NetFlow and IPFIX , 2014, IEEE Communications Surveys & Tutorials.

[12]  Youngseok Lee,et al.  Detecting DDoS attacks with Hadoop , 2011, CoNEXT '11 Student.

[13]  Yang Li,et al.  Building lightweight intrusion detection system using wrapper-based feature selection mechanisms , 2009, Comput. Secur..

[14]  Constantinos Dovrolis,et al.  Hierarchical IP flow clustering , 2017, CCRV.

[15]  Youngseok Lee,et al.  An Internet traffic analysis method with MapReduce , 2010, 2010 IEEE/IFIP Network Operations and Management Symposium Workshops.

[16]  Giovane C. M. Moura,et al.  ENTRADA: A high-performance network traffic data streaming warehouse , 2016, NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium.

[17]  Hiroshi Esaki,et al.  Mining Causality of Network Events in Log Data , 2018, IEEE Transactions on Network and Service Management.

[18]  Dario Rossi,et al.  Identifying Key Features for P2P Traffic Classification , 2011, 2011 IEEE International Conference on Communications (ICC).

[19]  Yifan Zhang,et al.  DNA: An SDN framework for distributed network analytics , 2015, 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM).

[20]  Michail Vlachos,et al.  Net-Fli: On-the-fly Compression, Archiving and Indexing of Streaming Network Traffic , 2010, Proc. VLDB Endow..

[21]  George Bebis,et al.  A supervised machine learning approach to classify host roles on line using sFlow , 2013, HPPN '13.

[22]  Dan Wu,et al.  TADOOP: Mining Network Traffic Anomalies with Hadoop , 2015, SecureComm.

[23]  Raouf Boutaba,et al.  A comprehensive survey on machine learning for networking: evolution, applications and research opportunities , 2018, Journal of Internet Services and Applications.

[24]  Sherif Sakr,et al.  The family of mapreduce and large-scale data processing systems , 2013, CSUR.

[25]  Grenville J. Armitage,et al.  A survey of techniques for internet traffic classification using machine learning , 2008, IEEE Communications Surveys & Tutorials.

[26]  John McHugh,et al.  Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory , 2000, TSEC.

[27]  Marco Mellia,et al.  Big Data in Computer Network Monitoring , 2019, Encyclopedia of Big Data Technologies.

[28]  Mohsen Guizani,et al.  Hadoop Based Real-Time Intrusion Detection for High-Speed Networks , 2016, 2016 IEEE Global Communications Conference (GLOBECOM).

[29]  Nasser Yazdani,et al.  Mutual information-based feature selection for intrusion detection systems , 2011, J. Netw. Comput. Appl..

[30]  Anirban Mahanti,et al.  Traffic classification using clustering algorithms , 2006, MineNet '06.

[31]  Sebastian Abt,et al.  Performance Evaluation of Classification and Feature Selection Algorithms for NetFlow-based Protocol Recognition , 2013, GI-Jahrestagung.

[32]  Martino Trevisan,et al.  AWESoME: Big Data for Automatic Web Service Management in SDN , 2018, IEEE Transactions on Network and Service Management.

[33]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[34]  Gianmarco De Francisci Morales,et al.  Big Data Stream Learning with SAMOA , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[35]  Jerry Byungik Ahn,et al.  Neuron machine: Parallel and pipelined digital neurocomputing architecture , 2012, 2012 IEEE International Conference on Computational Intelligence and Cybernetics (CyberneticsCom).

[36]  Behnaz Arzani,et al.  Taking the Blame Game out of Data Centers Operations with NetPoirot , 2016, SIGCOMM.

[37]  Linda Winkler,et al.  Combining Cisco NetFlow Exports with Relational Database Technology for Usage Statistics, Intrusion Detection, and Network Forensics , 2000, LISA.

[38]  Jon Crowcroft,et al.  Data Analytics Service Composition and Deployment on Edge Devices , 2018, Big-DAMA@SIGCOMM.

[39]  Marco Mellia,et al.  mPlane: an intelligent measurement plane for the internet , 2014, IEEE Communications Magazine.

[40]  William J. Buchanan,et al.  Applied Machine Learning predictive analytics to SQL Injection Attack detection and prevention , 2017, 2017 IFIP/IEEE Symposium on Integrated Network and Service Management (IM).

[41]  Bo Zong,et al.  Deep Learning IP Network Representations , 2018, Big-DAMA@SIGCOMM.

[42]  Sufian Hameed,et al.  Efficacy of Live DDoS Detection with Hadoop , 2015, NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium.

[43]  Saverio Niccolini,et al.  Net2Vec: Deep Learning for the Network , 2017, Big-DAMA@SIGCOMM.

[44]  Zahir Tari,et al.  A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis , 2014, IEEE Transactions on Emerging Topics in Computing.

[45]  Aiko Pras,et al.  An Overview of IP Flow-Based Intrusion Detection , 2010, IEEE Communications Surveys & Tutorials.

[46]  Mihui Kim,et al.  A Combined Data Mining Approach for DDoS Attack Detection , 2004, ICOIN.

[47]  ReedBenjamin,et al.  Building a high-level dataflow system on top of Map-Reduce , 2009, VLDB 2009.

[48]  Alexander Clemm,et al.  Model-driven analytics in SDN networks , 2017, 2017 IFIP/IEEE Symposium on Integrated Network and Service Management (IM).

[49]  Chiara Orsini,et al.  BGPStream: A Software Framework for Live and Historical BGP Data Analysis , 2016, Internet Measurement Conference.

[50]  Gabriel Maciá-Fernández,et al.  Anomaly-based network intrusion detection: Techniques, systems and challenges , 2009, Comput. Secur..

[51]  Yonggang Wen,et al.  Toward Scalable Systems for Big Data Analytics: A Technology Tutorial , 2014, IEEE Access.

[52]  Christian Fuchs,et al.  Implications of Deep Packet Inspection (DPI) Internet Surveillance for Society , 2012 .

[53]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[54]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[55]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[56]  Dario Rossi,et al.  Traffic Analysis with Off-the-Shelf Hardware: Challenges and Lessons Learned , 2017, IEEE Communications Magazine.

[57]  Radu State,et al.  A Big Data Architecture for Large Scale Security Monitoring , 2014, 2014 IEEE International Congress on Big Data.

[58]  Ioannis Konstantinou,et al.  Datix: A System for Scalable Network Analytics , 2015, CCRV.

[59]  Michalis Faloutsos,et al.  BLINC: multilevel traffic classification in the dark , 2005, SIGCOMM '05.

[60]  M. Anusha,et al.  Big Data-Survey , 2016 .

[61]  Kensuke Fukuda,et al.  GML learning, a generic machine learning model for network measurements analysis , 2017, 2017 13th International Conference on Network and Service Management (CNSM).

[62]  Sherif Sakr,et al.  Big Data 2.0 Processing Systems: Taxonomy and Open Challenges , 2016, Journal of Grid Computing.

[63]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[64]  Xin Yao,et al.  Sparse Approximation Through Boosting for Learning Large Scale Kernel Machines , 2010, IEEE Transactions on Neural Networks.

[65]  Tanja Zseby,et al.  Analysis of network traffic features for anomaly detection , 2014, Machine Learning.

[66]  Jugal K. Kalita,et al.  Network Anomaly Detection: Methods, Systems and Tools , 2014, IEEE Communications Surveys & Tutorials.

[67]  Chun-Hung Richard Lin,et al.  Intrusion detection system: A comprehensive review , 2013, J. Netw. Comput. Appl..

[68]  Luca Vassio,et al.  Users' Fingerprinting Techniques from TCP Traffic , 2017, Big-DAMA@SIGCOMM.

[69]  Ramani Duraiswami,et al.  A fast algorithm for learning large scale preference relations , 2007, AISTATS.

[70]  Arian Bär,et al.  Grasping Popular Applications in Cellular Networks With Big Data Analytics Platforms , 2016, IEEE Transactions on Network and Service Management.

[71]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[72]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[73]  Johan Garcia,et al.  Efficient Distribution-Derived Features for High-Speed Encrypted Flow Classification , 2018, NetAI@SIGCOMM.

[74]  Mark Crovella,et al.  Studying interdomain routing over long timescales , 2013, Internet Measurement Conference.

[75]  Athanasios V. Vasilakos,et al.  Parallel Processing Systems for Big Data: A Survey , 2016, Proceedings of the IEEE.

[76]  Aiko Pras,et al.  SSH Compromise Detection using NetFlow/IPFIX , 2014, CCRV.

[77]  Kensuke Fukuda,et al.  Hashdoop: A MapReduce framework for network anomaly detection , 2014, 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[78]  David C. Thompson,et al.  Design and Performance of a Scalable, Parallel Statistics Toolkit , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[79]  Dario Rossi,et al.  Reviewing Traffic Classification , 2013, Data Traffic Monitoring and Analysis.

[80]  Shucheng Yu,et al.  Privacy Preserving Back-Propagation Neural Network Learning Made Practical with Cloud Computing , 2014, IEEE Transactions on Parallel and Distributed Systems.

[81]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[82]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[83]  Marco Mellia,et al.  Large-scale network traffic monitoring with DBStream, a system for rolling big data analysis , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[84]  Masayuki Murata,et al.  Malicious URL sequence detection using event de-noising convolutional neural network , 2017, 2017 IEEE International Conference on Communications (ICC).

[85]  Tomás Jirsík,et al.  Real-time analysis of NetFlow data for generating network traffic statistics using Apache Spark , 2016, NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium.

[86]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[87]  Paolo Bellavista,et al.  Lightweight Internet Traffic Classification: A Subject-Based Solution with Word Embeddings , 2016, 2016 IEEE International Conference on Smart Computing (SMARTCOMP).

[88]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[89]  Pedro Casas,et al.  Ensemble-learning Approaches for Network Security and Anomaly Detection , 2017, Big-DAMA@SIGCOMM.

[90]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[91]  Athanasios V. Vasilakos,et al.  Big data analytics: a survey , 2015, Journal of Big Data.

[92]  Lukás Burget,et al.  Strategies for training large scale neural network language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[93]  Pere Barlet-Ros,et al.  Independent comparison of popular DPI tools for traffic classification , 2015, Comput. Networks.

[94]  Benoit Donnet,et al.  NETPerfTrace: Predicting Internet Path Dynamics and Performance with Machine Learning , 2017, Big-DAMA@SIGCOMM.

[95]  Robert Harper,et al.  Cookbook, a recipe for fault localization , 2018, NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium.

[96]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[97]  Dario Rossi,et al.  Snooping Wikipedia vandals with MapReduce , 2015, 2015 IEEE International Conference on Communications (ICC).

[98]  Youki Kadobayashi,et al.  MATATABI: Multi-layer Threat Analysis Platform with Hadoop , 2014, 2014 Third International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS).

[99]  GhemawatSanjay,et al.  The Google file system , 2003 .

[100]  Xiangyang Luo,et al.  Big Data Analytics for Information Security , 2018, Secur. Commun. Networks.

[101]  Shih-Fu Chang,et al.  Cross-domain learning methods for high-level visual concept classification , 2008, 2008 15th IEEE International Conference on Image Processing.

[102]  David E. Culler,et al.  Wide area cluster monitoring with Ganglia , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[103]  Yaniv Ben-Itzhak,et al.  Cluster-Based Load Balancing for Better Network Security , 2017, Big-DAMA@SIGCOMM.

[104]  Elena Baralis,et al.  SeLINA: A Self-Learning Insightful Network Analyzer , 2016, IEEE Transactions on Network and Service Management.

[105]  Sean Owen,et al.  Mahout in Action , 2011 .

[106]  Wolfgang Kellerer,et al.  Anomaly Detection and Identification in Large-scale Networks based on Online Time-structured Traffic Tensor Tracking , 2016 .

[107]  Risto Vaarandi,et al.  An unsupervised framework for detecting anomalous messages from syslog log files , 2018, NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium.

[108]  Dario Rossi,et al.  Telemetry-based stream-learning of BGP anomalies , 2018, Big-DAMA@SIGCOMM.

[109]  Guillaume Doyen,et al.  Detecting Botclouds at Large Scale: A Decentralized and Robust Detection Method for Multi-Tenant Virtualized Environments , 2018, IEEE Transactions on Network and Service Management.

[110]  Luca Deri,et al.  10 Gbit line rate packet-to-disk using n2disk , 2013, INFOCOM Workshops.

[111]  Ray W. Grout,et al.  Numerically stable, single-pass, parallel statistics algorithms , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[112]  Stefan Schmid,et al.  NetSlicer: Automated and Traffic-Pattern Based Application Clustering in Datacenters , 2018, Big-DAMA@SIGCOMM.

[113]  Li Guo,et al.  Survey and Taxonomy of Feature Selection Algorithms in Intrusion Detection System , 2006, Inscrypt.

[114]  Hiroshi Esaki,et al.  Finding Anomalies in Network System Logs with Latent Variables , 2018, Big-DAMA@SIGCOMM.

[115]  Dan Gunter,et al.  Scalable analysis of network measurements with Hadoop and Pig , 2012, 2012 IEEE Network Operations and Management Symposium.

[116]  Sylvia Ratnasamy,et al.  BlindBox: Deep Packet Inspection over Encrypted Traffic , 2015, SIGCOMM.

[117]  Kamsuriah Ahmad,et al.  A study on improvement of internet traffic measurement and analysis using Hadoop system , 2015, 2015 International Conference on Electrical Engineering and Informatics (ICEEI).

[118]  Sean Quinlan,et al.  GFS: Evolution on Fast-forward , 2009, ACM Queue.

[119]  Nagarajan Kandasamy,et al.  A New Approach to Dimensionality Reduction for Anomaly Detection in Data Traffic , 2016, IEEE Transactions on Network and Service Management.

[120]  Pedro Casas,et al.  Stream-based Machine Learning for Network Security and Anomaly Detection , 2018, Big-DAMA@SIGCOMM.