Towards Provable Network Traffic Measurement and Analysis via Semi-Labeled Trace Datasets

Research in network traffic measurement and analysis is a long-lasting field with growing interest from both scientists and the industry. However, even after so many years, results replication, criticism, and review are still rare. We face not only a lack of research standards, but also inaccessibility of appropriate datasets that can be used for methods development and evaluation. Therefore, a lot of potentially high-quality research cannot be verified and is not adopted by the industry or the community. The aim of this paper is to overcome this controversy with a unique solution based on a combination of distinct approaches proposed by other research works. Unlike these studies, we focus on the whole issue covering all areas of data anonymization, authenticity, recency, publicity, and their usage for research provability. We believe that these challenges can be solved by utilization of semi-labeled datasets composed of real-world network traffic and annotated units with interest-related packet traces only. In this paper, we outline the basic ideas of the methodology from unit trace collection and semi-labeled dataset creation to its usage for research evaluation. We strive for this proposal to start a discussion of the approach and help to overcome some of the challenges the research faces today.

[1]  Andreas Hotho,et al.  Flow-based benchmark data sets for intrusion detection , 2017 .

[2]  Sebastian Abt,et al.  Are We Missing Labels? A Study of the Availability of Ground-Truth in Network Security Research , 2014, 2014 Third International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS).

[3]  Jugal K. Kalita,et al.  Towards Generating Real-life Datasets for Network Intrusion Detection , 2015, Int. J. Netw. Secur..

[4]  William Yurcik,et al.  Toward Trusted Sharing of Network Packet Traces Using Anonymization: Single-Field Privacy/Analysis Tradeoffs , 2007, ArXiv.

[5]  Philip K. Chan,et al.  An Analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for Network Anomaly Detection , 2003, RAID.

[6]  John McHugh,et al.  Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory , 2000, TSEC.

[7]  Simson L. Garfinkel,et al.  Bringing science to digital forensics with standardized forensic corpora , 2009, Digit. Investig..

[8]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[9]  Martin Steinebach,et al.  Data Corpora for Digital Forensics Education and Research , 2014, IFIP Int. Conf. Digital Forensics.

[10]  Carlos Catania,et al.  Improving the Generation of Labeled Network Traffic Datasets Through Machine Learning Techniques , 2017 .

[11]  Bill Chu,et al.  A Study on Log Analysis Approaches Using Sandia Dataset , 2017, 2017 26th International Conference on Computer Communication and Networks (ICCCN).

[12]  G. J. Langley,et al.  The improvement guide : a practical approach to enhancing organizational performance , 1996 .

[13]  Jan Vykopal,et al.  KYPO Cyber Range: Design and Use Cases , 2017, ICSOFT.

[14]  Ali A. Ghorbani,et al.  An Evaluation Framework for Intrusion Detection Dataset , 2016, 2016 International Conference on Information Science and Security (ICISS).

[15]  Martin May,et al.  FLAME: A Flow-Level Anomaly Modeling Engine , 2008, CSET.

[16]  Frank Breitinger,et al.  Availability of datasets for digital forensics - And what is missing , 2017, Digit. Investig..

[17]  Michael Peacock,et al.  Creating Development Environments with Vagrant , 2013 .

[18]  Sergey Bratus What Hackers Learn that the Rest of Us Don't: Notes on Hacker Curriculum , 2007, IEEE Security & Privacy.

[19]  Fulvio Risso,et al.  PCAP Next Generation (pcapng) Capture File Format , 2020 .

[20]  Alexander D. Kent,et al.  Unified Host and Network Data Set , 2017, Security Science and Technology.

[21]  Ali A. Ghorbani,et al.  IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS 1 Toward Credible Evaluation of Anomaly-Based Intrusion-Detection Methods , 2022 .

[22]  Balachander Krishnamurthy,et al.  A Socratic method for validation of measurement-based networking research , 2011, Comput. Commun..

[23]  Max Mühlhäuser,et al.  Towards the creation of synthetic, yet realistic, intrusion detection datasets , 2016, NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium.