NapierOne: A modern mixed file data set alternative to Govdocs1

It was found when reviewing the ransomware detection research literature that almost no proposal provided enough detail on how the test data set was created, or sufficient description of its actual content, to allow it to be recreated by other researchers interested in reconstructing their environment and validating the research results. A modern cybersecurity mixed file data set called NapierOne is presented, primarily aimed at, but not limited to, ransomware detection and forensic analysis research. NapierOne was designed to address this deficiency in reproducibility and improve consistency by facilitating research replication and repeatability. The methodology used in the creation of this data set is also described in detail. The data set was inspired by the Govdocs1data set and it is intended that NapierOne be used as a complement to this original data set. An investigation was performed with the goal of determining the common files types currently in use. No specific research was found that explicitly provided this information, so an alternative consensus approach was employed. This involved combining the findings from multiple sources of file type usage into an overall ranked list. After which 5,000 real-world example files were gathered, and a specific data subset created, for each of the common file types identified. In some circumstances, multiple data subsets were created for a specific file type, each subset representing a specific characteristic for that file type. For example, there are multiple data subsets for the ZIP file type with each subset containing examples of a specific compression method. Ransomware execution tends to produce files that have high entropy, so examples of file types that naturally have this attribute are also present. The resulting entire data set comprises of a 100 separate data subsets divided between 44 distinct file types, resulting in almost 500,000 unique files in total. A description of the techniques used to gather the files for each file type is provided together with the actions that were performed on the files to confirm that they were of the highest quality and provided an accurate representation of their specific file type. Details are also provided on the content of the entire data set as well as instructions on how researchers can gain free and unlimited access to the final data set. While the data set was initially created to aid research in ransomware detection, it is sufficiently broad and diverse enough to allow for its application in many other areas of research that require a varied mixture of common real-world file examples. The NapierOne data set is an ongoing project and researchers are strongly encouraged to leverage this data set in their own research.

[1]  Shafii Muhammad Abdulhamid,et al.  Systematic literature review and metadata analysis of ransomware attacks and detection mechanisms , 2019, Journal of Reliable Intelligent Environments.

[2]  Sebastian Abt,et al.  Are We Missing Labels? A Study of the Availability of Ground-Truth in Network Security Research , 2014, 2014 Third International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS).

[3]  Wanli Ma,et al.  A Proposed Approach to Compound File Fragment Identification , 2014, NSS.

[4]  Simson L. Garfinkel,et al.  File Fragment Classification-The Case for Specialized Approaches , 2009, 2009 Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering.

[5]  Engin Kirda,et al.  UNVEIL: A large-scale, automated approach to detecting ransomware (keynote) , 2016, SANER.

[6]  Simson L. Garfinkel,et al.  Automating Disk Forensic Processing with SleuthKit, XML and Python , 2009, 2009 Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering.

[7]  The NIST Plan for Providing Public Access to Results of Federally Funded Research , 2015 .

[8]  Budi Arief,et al.  A Roadmap for Improving the Impact of Anti-ransomware Research , 2019, NordSec.

[9]  Andrew Zisserman,et al.  A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.

[10]  Vassil Roussev,et al.  An evaluation of forensic similarity hashes , 2011, Digit. Investig..

[11]  Simson L. Garfinkel,et al.  Bringing science to digital forensics with standardized forensic corpora , 2009, Digit. Investig..

[12]  Colin Morris,et al.  Using NLP techniques for file fragment classification , 2012, Digit. Investig..

[13]  Daniel Morato,et al.  A Survey on Detection Techniques for Cryptographic Ransomware , 2019, IEEE Access.

[14]  Xavier Bresson,et al.  FMA: A Dataset for Music Analysis , 2016, ISMIR.

[15]  REGULATION (EU) 2019/518 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL , 2015 .

[16]  William J. Buchanan,et al.  Approaches to the classification of high entropy file fragments , 2013, Digit. Investig..

[17]  Giulia Boato,et al.  RAISE: a raw images dataset for digital image forensics , 2015, MMSys.

[18]  Tooska Dargahi,et al.  A Cyber-Kill-Chain based taxonomy of crypto-ransomware features , 2019, Journal of Computer Virology and Hacking Techniques.

[19]  Budi Arief,et al.  Why Current Statistical Approaches to Ransomware Detection Fail , 2020, ISC.

[20]  Andrew N. Jackson Formats over Time: Exploring UK Web History , 2012, iPRES.

[21]  Yoojae Won,et al.  Ransomware detection method based on context-aware entropy analysis , 2018, Soft Comput..

[22]  Bander Ali Saleh Al-rimy,et al.  Ransomware threat success factors, taxonomy, and countermeasures: A survey and research directions , 2018, Comput. Secur..

[23]  Patrick Traynor,et al.  CryptoLock (and Drop It): Stopping Ransomware Attacks on User Data , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[24]  Ibrahim M. Baggili,et al.  Data Sources for Advancing Cyber Forensics: What the Social World Has to Offer , 2015, AAAI Spring Symposia.

[25]  Sarah L. Nesbeitt The Internet Archive Wayback Machine , 2002 .

[26]  Michael R. McCarrin Exploration and validation of the sdhash parameter space , 2013 .

[27]  Ali Kashif Bashir,et al.  Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2013, ICIRA 2013.

[28]  Ned Freed,et al.  Media Type Specifications and Registration Procedures , 2005, RFC.

[29]  Simson L. Garfinkel,et al.  Digital forensics research: The next 10 years , 2010, Digit. Investig..

[30]  Gianluca Stringhini,et al.  PayBreak: Defense Against Cryptographic Ransomware , 2017, AsiaCCS.

[31]  Fabio De Gaspari,et al.  EnCoD: Distinguishing Compressed and Encrypted File Fragments , 2020, NSS.

[32]  Frank Breitinger,et al.  Availability of datasets for digital forensics - And what is missing , 2017, Digit. Investig..

[33]  Nathaniel S. Borenstein,et al.  Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types , 1996, RFC.

[34]  Vassil Roussev,et al.  Real-time digital forensics and triage , 2013, Digit. Investig..