Towards Evaluating Quality of Datasets for Network Traffic Domain

This paper deals with the quality of network traffic datasets created to train and validate machine learning classification and detection methods. Naturally, there is a long epoch of research targeted at data quality; however, it is focused mainly on data consistency, validity, precision, and other metrics, which are insufficient for network traffic use-cases. The rise of Machine learning usage in network monitoring applications requires a new methodology for evaluation datasets. There is a need to evaluate and compare traffic samples captured at different conditions and decide the usability of the already captured and annotated data. This paper aims to explain a use case of dataset creation, propose definitions regarding the quality of the network traffic datasets, and finally, describe a framework for datasets analysis.

[1]  M. Kubát An Introduction to Machine Learning , 2017, Springer International Publishing.

[2]  Mohamed Medhat Gaber,et al.  CHIRPS: Explaining random forest classification , 2020, Artificial Intelligence Review.

[3]  Diane M. Strong,et al.  AIMQ: a methodology for information quality assessment , 2002, Inf. Manag..

[4]  Mykola Pechenizkiy,et al.  An Overview of Concept Drift Applications , 2016 .

[5]  Viktor Pus,et al.  Building a feedback loop to capture evidence of network incidents , 2016, 2016 12th International Conference on Network and Service Management (CNSM).

[6]  Simone Sacchi,et al.  Definitions of dataset in the scientific and technical literature , 2010, ASIST.

[7]  Mario Piattini,et al.  A Data Quality in Use model for Big Data , 2016, Future Gener. Comput. Syst..

[8]  Jorge Bernardino,et al.  A Survey on Data Quality: Classifying Poor Data , 2015, 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC).

[9]  Barbara Hammer,et al.  How to Quantitatively Compare Data Dissimilarities for Unsupervised Machine Learning? , 2012, ANNPR.

[10]  Jian Liu,et al.  A DGA domain names detection modeling method based on integrating an attention mechanism and deep neural network , 2020, Cybersecur..

[11]  Jan Brabec,et al.  On Model Evaluation Under Non-constant Class Imbalance , 2020, ICCS.

[12]  C. V. van Heugten,et al.  Defining the content of a minimal dataset for acquired brain injury using a Delphi procedure , 2020, Health and Quality of Life Outcomes.

[13]  D. Elizondo,et al.  Are public intrusion datasets fit for purpose characterising the state of the art in intrusion event datasets , 2020, Comput. Secur..

[14]  John Byabazaire,et al.  Using Trust as a Measure to Derive Data Quality in Data Shared IoT Deployments , 2020, 2020 29th International Conference on Computer Communications and Networks (ICCCN).