Towards Syntactic Approximate Matching - A Pre-Processing Experiment

Over the past few years the popularity of approximate matching algorithms (a.k.a. fuzzy hashing) has increased. Especially within the area of bytewise approximate matching, several algorithms were published, tested and improved. It has been shown that these algorithms are powerful, however they are sometimes too precise for real world investigations. That is, even very small commonalities (e.g., in the header of a le) can cause a match. While this is a desired property, it may also lead to unwanted results. In this paper we show that by using simple pre-processing, we signicantly can in uence the outcome. Although our test set is based on text-based le types (cause of an easy processing), this technique can be used for other, well-documented types as well. Our results show, that it can be benecial to focus on the content of les only (depending on the use-case). While for this experiment we utilized text les, Additionally, we present a small, self-created dataset that can be used in the future for approximate matching algorithms since it is labeled (we know which les are similar and how).

[1]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[2]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[3]  Vassil Roussev,et al.  Approximate Matching: Definition and Terminology , 2014 .

[4]  Harald Baier,et al.  Similarity Preserving Hashing: Eligible Properties and a New Algorithm MRSH-v2 , 2012, ICDF2C.

[5]  Simson L. Garfinkel,et al.  Hash-based carving: Searching media for complete files and file fragments with sector hashing and hashdb , 2015, Digit. Investig..

[6]  Jesse D. Kornblum Identifying almost identical files using context triggered piecewise hashing , 2006, Digit. Investig..

[7]  Katrin Franke,et al.  Practical use of Approximate Hash Based Matching in digital investigations , 2014, Digit. Investig..

[8]  Nathaniel S. Borenstein,et al.  MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies , 1992, RFC.

[9]  Simson L. Garfinkel,et al.  Practical Applications of Bloom Filters to the NIST RDS and Hard Drive Triage , 2008, 2008 Annual Computer Security Applications Conference (ACSAC).

[10]  Vassil Roussev,et al.  Evaluating detection error trade-offs for bytewise approximate matching algorithms , 2014, Digit. Investig..

[11]  Vassil Roussev,et al.  Data Fingerprinting with Similarity Digests , 2010, IFIP Int. Conf. Digital Forensics.

[12]  Harald Baier,et al.  Security Aspects of Piecewise Hashing in Computer Forensics , 2011, 2011 Sixth International Conference on IT Security Incident Management and IT Forensics.