Data mining and machine learning methods for sustainable smart cities traffic classification: A survey

Abstract This survey paper describes the significant literature survey of Sustainable Smart Cities (SSC), Machine Learning (ML), Data Mining (DM), datasets, feature extraction and selection for network traffic classification. Considering relevance and most cited methods and datasets of features were identified, read and summarized. As data and data features are essential in Internet traffic classification using machine learning techniques, some well-known and most used datasets with details statistical features are described. Different classification techniques for SSC network traffic classification are presented with more information. The complexity of data set, features extraction and machine learning methods are addressed. In the end, challenges and recommendations for SSC network traffic classification with the dataset of features are presented.

[1]  Seema Shah,et al.  A Comprehensive Survey of Machine Learning-Based Network Intrusion Detection , 2018, Smart Intelligent Computing and Applications.

[2]  Bo Yang,et al.  Effectiveness of Statistical Features for Early Stage Internet Traffic Identification , 2016, International Journal of Parallel Programming.

[3]  Carey L. Williamson,et al.  Internet Traffic Measurement , 2001, IEEE Internet Comput..

[4]  He Deng,et al.  A P2P Network Traffic Classification Method Using SVM , 2008, 2008 The 9th International Conference for Young Computer Scientists.

[5]  Angela Orebaugh,et al.  Wireshark & Ethereal Network Protocol Analyzer Toolkit , 2007 .

[6]  Bo Yang,et al.  Traffic classification using probabilistic neural networks , 2010, 2010 Sixth International Conference on Natural Computation.

[7]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[8]  Judith Kelner,et al.  A Survey on Internet Traffic Identification , 2009, IEEE Communications Surveys & Tutorials.

[9]  Václav Snásel,et al.  Survey: Using Genetic Algorithm Approach in Intrusion Detection Systems Techniques , 2008, 2008 7th Computer Information Systems and Industrial Management Applications.

[10]  Béla Hullár,et al.  Early Identification of Peer-to-Peer Traffic , 2011, 2011 IEEE International Conference on Communications (ICC).

[11]  Gordon Fyodor Lyon,et al.  Nmap Network Scanning: The Official Nmap Project Guide to Network Discovery and Security Scanning , 2009 .

[12]  Muhammad Shafiq,et al.  Effective Packet Number for 5G IM WeChat Application at Early Stage Traffic Classification , 2017, Mob. Inf. Syst..

[13]  Vijay Varadharajan,et al.  A Detailed Investigation and Analysis of Using Machine Learning Techniques for Intrusion Detection , 2019, IEEE Communications Surveys & Tutorials.

[14]  Stephen R. Garner,et al.  WEKA: The Waikato Environment for Knowledge Analysis , 1996 .

[15]  Bo Yang,et al.  Imbalanced traffic identification using an imbalanced data gravitation-based classification model , 2017, Comput. Commun..

[16]  Benoit Claise,et al.  Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of IP Traffic Flow Information , 2008, RFC.

[17]  Qiang Ye,et al.  A machine learning based intrusion detection scheme for data fusion in mobile clouds involving heterogeneous client networks , 2019, Inf. Fusion.

[18]  Zhigang Zeng,et al.  A modified Elman neural network with a new learning rate scheme , 2018, Neurocomputing.

[19]  Renata Teixeira,et al.  Traffic classification on the fly , 2006, CCRV.

[20]  Yue Yuan,et al.  Improving prediction performance for indoor temperature in public buildings based on a novel deep learning method , 2019, Building and Environment.

[21]  J. L. Rana,et al.  Taxonomy of Anomaly Based Intrusion Detection System: A Review , 2012 .

[22]  Sven Casteleyn,et al.  The Lisbon ranking for smart sustainable cities in Europe , 2019, Sustainable Cities and Society.

[23]  Radu State,et al.  Machine Learning Approach for IP-Flow Record Anomaly Detection , 2011, Networking.

[24]  Lawrence F. Shampine,et al.  The MATLAB ODE Suite , 1997, SIAM J. Sci. Comput..

[25]  Daniel Kudenko,et al.  Distributed response to network intrusions using multiagent reinforcement learning , 2015, Eng. Appl. Artif. Intell..

[26]  Peter Henderson,et al.  An Introduction to Deep Reinforcement Learning , 2018, Found. Trends Mach. Learn..

[27]  Michel Dagenais,et al.  Machine Learning-Based EDoS Attack Detection Technique Using Execution Trace Analysis , 2019, Journal of Hardware and Systems Security.

[28]  Bo Yang,et al.  Effective packet number for early stage internet traffic identification , 2015, Neurocomputing.

[29]  Yanghee Choi,et al.  Internet traffic classification demystified: on the sources of the discriminative power , 2010, CoNEXT.

[30]  Simon Elias Bibri,et al.  Smart sustainable cities of the future: An extensive interdisciplinary literature review , 2017 .

[31]  Nen-Fu Huang,et al.  Application traffic classification at the early stage by characterizing application rounds , 2013, Inf. Sci..

[32]  Jingfeng Xue,et al.  Detecting anomalous traffic in the controlled network based on cross entropy and support vector machine , 2019, IET Inf. Secur..

[33]  Zubair Shafiq,et al.  Real-time Video Quality of Experience Monitoring for HTTPS and QUIC , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[34]  Bo Yang,et al.  Feature Evaluation for Early Stage Internet Traffic Identification , 2014, ICA3PP.

[35]  Christopher Krügel,et al.  Bayesian event classification for intrusion detection , 2003, 19th Annual Computer Security Applications Conference, 2003. Proceedings..

[36]  Mario Kolberg,et al.  Towards Optimizing WLANs Power Saving: Novel Context-Aware Network Traffic Classification Based on a Machine Learning Approach , 2019, IEEE Access.

[37]  Dawei Wang,et al.  Effective Feature Selection for 5G IM Applications Traffic Classification , 2017, Mob. Inf. Syst..

[38]  F. Richard Yu,et al.  A Survey of Machine Learning Techniques Applied to Software Defined Networking (SDN): Research Issues and Challenges , 2019, IEEE Communications Surveys & Tutorials.

[39]  Oleg S. Pianykh,et al.  Current Applications and Future Impact of Machine Learning in Radiology. , 2018, Radiology.

[40]  Vern Paxson,et al.  Strategies for sound internet measurement , 2004, IMC '04.

[41]  Young B. Moon,et al.  Detecting cyber-physical attacks in CyberManufacturing systems with machine learning methods , 2017, Journal of Intelligent Manufacturing.

[42]  Chunhua Wang,et al.  Machine Learning and Deep Learning Methods for Cybersecurity , 2018, IEEE Access.

[43]  Gang Lu,et al.  Feature selection for optimizing traffic classification , 2012, Comput. Commun..

[44]  Vern Paxson,et al.  Issues and etiquette concerning use of shared measurement data , 2007, IMC '07.

[45]  David Moore,et al.  The CoralReef Software Suite as a Tool for System and Network Administrators , 2001, LISA.

[46]  Shadi Aljawarneh,et al.  Anomaly-based intrusion detection system through feature selection analysis and building hybrid efficient model , 2017, J. Comput. Sci..

[47]  Luca Salgarelli,et al.  Support Vector Machines for TCP traffic classification , 2009, Comput. Networks.

[48]  M. Narayanan An Efficient Method to Classify the Peer-to-Peer Network Videos and Video Servers Over Video on Demand Services , 2019 .

[49]  Sebastian Zander,et al.  Timely and Continuous Machine-Learning-Based Classification for Interactive IP Traffic , 2012, IEEE/ACM Transactions on Networking.

[50]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[51]  Luca Salgarelli,et al.  On the stability of the information carried by traffic flow features at the packet level , 2009, CCRV.

[52]  Bo Yang,et al.  Traffic identification using flexible neural trees , 2010, 2010 IEEE 18th International Workshop on Quality of Service (IWQoS).

[53]  K. A. Taher,et al.  Network Intrusion Detection using Supervised Machine Learning Technique with Feature Selection , 2019, 2019 International Conference on Robotics,Electrical and Signal Processing Techniques (ICREST).

[54]  Grenville J. Armitage,et al.  A survey of techniques for internet traffic classification using machine learning , 2008, IEEE Communications Surveys & Tutorials.

[55]  Mritunjay Kumar Rai,et al.  Identifying P2P traffic: A survey , 2016, Peer-to-Peer Networking and Applications.

[56]  Ali A. Ghorbani,et al.  A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[57]  Dario Rossi,et al.  Experiences of Internet traffic monitoring with tstat , 2011, IEEE Network.

[58]  Steven Salzberg,et al.  Programs for Machine Learning , 2004 .

[59]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[60]  Ali A. Ghorbani,et al.  Toward developing a systematic approach to generate benchmark datasets for intrusion detection , 2012, Comput. Secur..

[61]  Ning Weng,et al.  Scalable Many-Field Packet Classification for Traffic Steering in SDN Switches , 2019, IEEE Transactions on Network and Service Management.

[62]  Gabriel Maciá-Fernández,et al.  Anomaly-based network intrusion detection: Techniques, systems and challenges , 2009, Comput. Secur..

[63]  Ahmad Akbari,et al.  Genetic-based minimum classification error mapping for accurate identifying Peer-to-Peer applications in the internet traffic , 2011, Expert Syst. Appl..

[64]  Raj Jain,et al.  Flow online identification method for the encrypted Skype , 2019, J. Netw. Comput. Appl..

[65]  Oliver Spatscheck,et al.  Accurate, scalable in-network identification of p2p traffic using application signatures , 2004, WWW '04.

[66]  Mohammad Zulkernine,et al.  Random-Forests-Based Network Intrusion Detection Systems , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[67]  Mingtian Zhou,et al.  Internet traffic classification using feed-forward neural network , 2011, 2011 International Conference on Computational Problem-Solving (ICCP).

[68]  Hong Zhu,et al.  A survey on feature extraction for pattern recognition , 2011, Artificial Intelligence Review.

[69]  Jugal K. Kalita,et al.  Towards Generating Real-life Datasets for Network Intrusion Detection , 2015, Int. J. Netw. Secur..

[70]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[71]  M. Hadi Amini,et al.  Simultaneous allocation of electric vehicles’ parking lots and distributed renewable resources in smart power distribution networks , 2017 .

[72]  Biswanath Mukherjee,et al.  Scheduling with machine-learning-based flow detection for packet-switched optical data center networks , 2018, IEEE/OSA Journal of Optical Communications and Networking.

[73]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[74]  Ben Bradford,et al.  Security and the smart city: A systematic review , 2020 .

[75]  Bing-Yuan Cao,et al.  Best concept selection in design process: An application of generalized intuitionistic fuzzy soft sets , 2018, J. Intell. Fuzzy Syst..

[76]  Harshal N. Datir,et al.  Survey on Hybrid Data Mining Algorithms for Intrusion Detection System , 2019 .

[77]  Michalis Faloutsos,et al.  Transport layer identification of P2P traffic , 2004, IMC '04.

[78]  Milton L. Mueller,et al.  Deep packet inspection and bandwidth management: Battles over BitTorrent in Canada and the United States , 2012 .

[79]  Steven L. Salzberg,et al.  Book Review: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993 , 1994, Machine Learning.

[80]  Ivan Martinovic,et al.  MalAlert: Detecting Malware in Large-Scale Network Traffic Using Statistical Features , 2019, PERV.

[81]  Erhan Guven,et al.  A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection , 2016, IEEE Communications Surveys & Tutorials.

[82]  Kurt D. Zeilenga,et al.  Internet Assigned Numbers Authority (IANA) Considerations for the Lightweight Directory Access Protocol (LDAP) , 2002, RFC.

[83]  Tasho Kaletha,et al.  Simple wild $L$-packets , 2011, Journal of the Institute of Mathematics of Jussieu.

[84]  Dan Meng,et al.  On Accuracy of Early Traffic Classification , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[85]  David L. Olson,et al.  Advanced Data Mining Techniques , 2008 .

[86]  Saad Mekhilef,et al.  Performance evaluation of a stand-alone PV-wind-diesel-battery hybrid system feasible for a large resort center in South China Sea, Malaysia , 2017 .

[87]  Niccolo Cascarano,et al.  GT: picking up the truth from the ground for internet traffic , 2009, CCRV.

[88]  Baihai Zhang,et al.  Research on Network Intrusion Detection Based on Incremental Extreme Learning Machine and Adaptive Principal Component Analysis , 2019, Energies.

[89]  Bhagya Nathali Silva,et al.  Towards sustainable smart cities: A review of trends, architectures, components, and open challenges in smart cities , 2018 .

[90]  Nabin Kumar Karn,et al.  WeChat Text and Picture Messages Service Flow Traffic Classification Using Machine Learning Technique , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[91]  Grenville Armitage,et al.  A synthetic traffic model for Half-Life , 2003 .

[92]  Manish Mahajan,et al.  Time-Series Outlier Detection Using Enhanced K-Means in Combination with PSO Algorithm , 2019 .

[93]  David Hutchison,et al.  Internet traffic characterisation: Third-order statistics & higher-order spectra for precise traffic modelling , 2018, Comput. Networks.

[94]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[95]  Sakir Sezer,et al.  Classification of P2P and HTTP Using Specific Protocol Characteristics , 2009, EUNICE.

[96]  Jesús E. Díaz-Verdejo,et al.  Performance of OpenDPI in Identifying Sampled Network Traffic , 2013, J. Networks.

[97]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[98]  Kazuhiko Ohkubo,et al.  A Botnet Detection Method on SDN using Deep Learning , 2019, 2019 IEEE International Conference on Consumer Electronics (ICCE).

[99]  Peng Jiang,et al.  An Intelligent Outlier Detection Method With One Class Support Tucker Machine and Genetic Algorithm Toward Big Sensor Data in Internet of Things , 2019, IEEE Transactions on Industrial Electronics.

[100]  Lizhi Peng,et al.  Feature Selection Toward Optimizing Internet Traffic Behavior Identification , 2014, ICA3PP.

[101]  Rossitza Setchi,et al.  Feature selection using Joint Mutual Information Maximisation , 2015, Expert Syst. Appl..

[102]  Witawas Srisa-an,et al.  Significant Permission Identification for Machine-Learning-Based Android Malware Detection , 2018, IEEE Transactions on Industrial Informatics.

[103]  Tan Yigitcanlar,et al.  Can cities become smart without being sustainable? A systematic review of the literature , 2019, Sustainable Cities and Society.

[104]  Boleslaw K. Szymanski,et al.  NETWORK-BASED INTRUSION DETECTION USING NEURAL NETWORKS , 2002 .

[105]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[106]  Praphula Kumar Jain,et al.  Two-Step Anomaly Detection Approach Using Clustering Algorithm , 2018, International Conference on Advanced Computing Networking and Informatics.

[107]  Nabin Kumar Karn,et al.  Network Traffic Classification techniques and comparative analysis using Machine Learning algorithms , 2016, 2016 2nd IEEE International Conference on Computer and Communications (ICCC).

[108]  Jin Song Dong,et al.  Genetic Algorithm: Theory, Literature Review, and Application in Image Reconstruction , 2019, Nature-Inspired Optimizers.

[109]  Anirban Mahanti,et al.  Byte me: a case for byte accuracy in traffic classification , 2007, MineNet '07.

[110]  Asif Ali Laghari,et al.  WeChat Text Messages Service Flow Traffic Classification Using Machine Learning Technique , 2016, 2016 6th International Conference on IT Convergence and Security (ICITCS).

[111]  Matthew Roughan,et al.  Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification , 2004, IMC '04.

[112]  R. Jha,et al.  Anomaly detection in network traffic using K-mean clustering , 2016, 2016 3rd International Conference on Recent Advances in Information Technology (RAIT).

[113]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[114]  Antonio Pescapè,et al.  Issues and future directions in traffic classification , 2012, IEEE Network.

[115]  Andrew W. Moore,et al.  A Machine Learning Approach for Efficient Traffic Classification , 2007, 2007 15th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[116]  Andrea Baiocchi,et al.  Low complexity, high performance neuro-fuzzy system for Internet traffic flows early classification , 2013, 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC).

[117]  Stefan Savage,et al.  Unexpected means of protocol inference , 2006, IMC '06.

[118]  W. Timothy Strayer,et al.  Using Machine Learning Techniques to Identify Botnet Traffic , 2006 .

[119]  Andrew W. Moore,et al.  Bayesian Neural Networks for Internet Traffic Classification , 2007, IEEE Transactions on Neural Networks.

[120]  Grenville J. Armitage,et al.  A synthetic traffic model for Quake3 , 2004, ACE '04.

[121]  Konstantina Papagiannaki,et al.  Toward the Accurate Identification of Network Applications , 2005, PAM.

[122]  Simon Elias Bibri,et al.  The IoT for smart sustainable cities of the future: An analytical framework for sensor-based big data applications for environmental sustainability , 2018 .

[123]  Ugo Silva Dias,et al.  QoS Management and Flexible Traffic Detection Architecture for 5G Mobile Networks , 2019, Sensors.

[124]  Antonio Pescapè,et al.  Early Classification of Network Traffic through Multi-classification , 2011, TMA.

[125]  S. Rijcke,et al.  Bibliometrics: The Leiden Manifesto for research metrics , 2015, Nature.

[126]  Ayman I. Kayssi,et al.  Mobile Traffic Anonymization Through Probabilistic Distribution , 2019, 2019 22nd Conference on Innovation in Clouds, Internet and Networks and Workshops (ICIN).