Data for Digital Forensics: Why a Discussion on “How Realistic is Synthetic Data” is Dispensable

Digital forensics depends on data sets for various purposes like concept evaluation, educational training, and tool validation. Researchers have gathered such data sets into repositories and created data simulation frameworks for producing large amounts of data. Synthetic data often face skepticism due to its perceived deviation from real-world data, raising doubts about its realism. This paper addresses this concern, arguing that there is no definitive answer. We focus on four common digital forensic use cases that rely on data. Through these, we elucidate the specifications and prerequisites of data sets within their respective contexts. Our discourse uncovers that both real-world and synthetic data are indispensable for advancing digital forensic science, software, tools, and the competence of practitioners. Additionally, we provide an overview of available data set repositories and data generation frameworks, contributing to the ongoing dialogue on digital forensic data sets’ utility.

[1]  Frank Breitinger,et al.  Sharing datasets for digital forensic: A novel taxonomy and legal concerns , 2023, Forensic Science International: Digital Investigation.

[2]  Harald Baier,et al.  FRASHER - A framework for automated evaluation of similarity hashing , 2022, Digit. Investig..

[3]  Harald Baier,et al.  ForTrace - A holistic forensic data set synthesis framework , 2022, Digit. Investig..

[4]  William J Buchanan,et al.  NapierOne: A modern mixed file data set alternative to Govdocs1 , 2022, Digit. Investig..

[5]  Graeme Horsman,et al.  Dataset construction challenges for digital forensics , 2021, Digit. Investig..

[6]  George Grispos,et al.  FADE : A forensic image generator for android device education , 2021, WIREs Forensic Science.

[7]  Xiaoyu Du,et al.  TraceGen: User activity emulation for digital forensic test image generation , 2020, Forensic Science International: Digital Investigation.

[8]  Umit Karabiyik,et al.  Towards reliable digital forensics investigations through measurement science , 2020 .

[9]  Rudolf Mayer,et al.  On the Utility of Synthetic Data: An Empirical Evaluation on Machine Learning Tasks , 2019, ARES.

[10]  Ibrahim Baggili,et al.  A Practitioner Survey Exploring the Value of Forensic Tools, AI, Filtering, & Safer Presentation for Investigating Child Sexual Abuse Material (CSAM) , 2019, Digit. Investig..

[11]  Graeme Horsman,et al.  Tool testing and reliability issues in the field of digital forensics , 2019, Digit. Investig..

[12]  Graeme Horsman,et al.  "I couldn't find it your honour, it mustn't be there!" - Tool errors, tool limitations and user error in digital forensics. , 2018, Science & justice : journal of the Forensic Science Society.

[13]  Campbell Wilson,et al.  Laying foundations for effective machine learning in law enforcement. Majura - A labelling schema for child exploitation materials , 2018, Digit. Investig..

[14]  Brian Mac Namee,et al.  Deep learning at the shallow end: Malware classification for non-domain experts , 2018, Digit. Investig..

[15]  Laura Sánchez,et al.  Experience constructing the Artifact Genome Project (AGP): Managing the domain's knowledge one artifact at a time , 2018, Digit. Investig..

[16]  Sven Schmitt,et al.  Introducing Anti-Forensics to SQLite Corpora and Tool Testing , 2018, 2018 11th International Conference on IT Security Incident Management & IT Forensics (IMF).

[17]  Felix C. Freiling,et al.  A standardized corpus for SQLite database forensics , 2018, Digit. Investig..

[18]  Abdelouahid Derhab,et al.  MalDozer: Automatic framework for android malware detection using deep learning , 2018, Digit. Investig..

[19]  Frank Breitinger,et al.  Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees , 2017, ICDF2C.

[20]  Frank Breitinger,et al.  Availability of datasets for digital forensics - And what is missing , 2017, Digit. Investig..

[21]  Xiaoyu Du,et al.  EviPlant: An efficient digital forensic challenge creation, manipulation and distribution solution , 2017, Digit. Investig..

[22]  Kevin J. Conlan,et al.  Anti-forensics: Furthering digital forensic science through a new extended, granular taxonomy , 2016, Digit. Investig..

[23]  Brett A. Becker,et al.  Current Challenges and Future Research Areas for Digital Forensic Investigation , 2016, ArXiv.

[24]  Sebastian Abt,et al.  A research process that ensures reproducible network security research , 2015, 2015 11th International Conference on Network and Service Management (CNSM).

[25]  Sebastian Abt,et al.  A Plea for Utilising Synthetic Data when Performing Machine Learning Based Cyber-Security Experiments , 2014, AISec '14.

[26]  Sebastian Abt,et al.  Are We Missing Labels? A Study of the Availability of Ground-Truth in Network Security Research , 2014, 2014 Third International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS).

[27]  Richard P. Ayers,et al.  Ten years of computer forensic tool testing , 2014 .

[28]  Martin Steinebach,et al.  Data Corpora for Digital Forensics Education and Research , 2014, IFIP Int. Conf. Digital Forensics.

[29]  Paul Douglas,et al.  Automatic Creation of Computer Forensic Test Images , 2012, IWCF.

[30]  Simson L. Garfinkel,et al.  Lessons learned writing digital forensics tools and managing a 30TB digital evidence corpus , 2012, Digit. Investig..

[31]  Vassil Roussev,et al.  An evaluation of forensic similarity hashes , 2011, Digit. Investig..

[32]  Christopher A. Lee,et al.  Creating Realistic Corpora for Security and Forensic Education , 2011 .

[33]  Simson L. Garfinkel,et al.  Digital forensics research: The next 10 years , 2010, Digit. Investig..

[34]  Simson L. Garfinkel,et al.  Bringing science to digital forensics with standardized forensic corpora , 2009, Digit. Investig..

[35]  Harold Feder,et al.  Daubert v. Merrell Dow Pharmaceuticals, Inc. , 2008 .

[36]  Richard P. Ayers,et al.  Digital Forensics at the National Institute of Standards and Technology , 2008 .

[37]  James L. Lyle,et al.  NIST CFTT: Testing Disk Imaging Tools , 2002, Int. J. Digit. EVid..

[38]  Harald Baier,et al.  Evaluation of Network Traffic Analysis Using Approximate Matching Algorithms , 2021, IFIP Int. Conf. Digital Forensics.

[39]  Jan H. P. Eloff,et al.  Digital forensics supported by machine learning for the detection of online sexual predatory chats , 2021, Digit. Investig..

[40]  Markus Hess Digital Forensics , 2021, Computer Vision.

[41]  Harald Baier,et al.  A Novel Approach for Generating Synthetic Datasets for Digital Forensics , 2020, IFIP Int. Conf. Digital Forensics.

[42]  Ibrahim M. Baggili,et al.  Data Sources for Advancing Cyber Forensics: What the Social World Has to Offer , 2015, AAAI Spring Symposia.

[43]  Brian Cusack,et al.  Identifying Bugs In Digital Forensic Tools , 2013 .

[44]  Prakash M. Nadkarni,et al.  What Is Metadata , 2011 .

[45]  Simson L. Garfinkel,et al.  Forensic Corpora: a Challenge for Forensic Research 1 Computer Forensics and Today's Forensic Tools , 2007 .

[46]  Brian D. Carrier,et al.  File System Forensic Analysis , 2005 .

[47]  Brian D. Carrier,et al.  Open Source Digital Forensics Tools The Legal Argument 1 , 2003 .