Automated evaluation of approximate matching algorithms on real data

Bytewise approximate matching is a relatively new area within digital forensics, but its importanceisgrowingquicklyaspractitionersarelookingforfastmethodstoscreenandanalyzethe increasing amounts of data in forensic investigations. The essential idea is to complement the use of cryptographic hash functions to detect data objects with bytewise identical representation with the capability to find objects with bytewise similar representations. Unlike cryptographic hash functions, which have been studied and tested for a long time, approximate matching ones are still in their early development stages and evaluation methodology is still evolving. Broadly, prior approaches have used either a human in the loop to manually evaluate the goodness of similarity matches on real world data, or controlled (pseudo-random) data to perform automated evaluation. This work’s contribution is to introduce automated approximate matching evaluation on real data by relating approximate matching results to the longest common substring (LCS). Specifically, we introduce a computationally efficient LCS approximation and use it to obtain ground truth on the t5 set. Using the results, we evaluate three existing approximate matching schemes relative to LCS and analyze their performance. a 2014 The Authors. Published by Elsevier Ltd on behalf of DFRWS. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).

[1]  Guoyin Wang,et al.  An Efficient Piecewise Hashing Method for Computer Forensics , 2008, First International Workshop on Knowledge Discovery and Data Mining (WKDD 2008).

[2]  Harald Baier,et al.  Similarity Preserving Hashing: Eligible Properties and a New Algorithm MRSH-v2 , 2012, ICDF2C.

[3]  Jesse D. Kornblum Identifying almost identical files using context triggered piecewise hashing , 2006, Digit. Investig..

[4]  Golden G. Richard,et al.  Multi-resolution similarity hashing , 2007, Digit. Investig..

[5]  Alfred Menezes,et al.  Handbook of Applied Cryptography , 2018 .

[6]  Harlan Carvey,et al.  Digital Forensics with Open Source Tools , 2011 .

[7]  Harald Baier,et al.  Security Aspects of Piecewise Hashing in Computer Forensics , 2011, 2011 Sixth International Conference on IT Security Incident Management and IT Forensics.

[8]  S. Esakkirajan,et al.  Fundamentals of relational database management systems , 2007 .

[9]  Vassil Roussev,et al.  Data Fingerprinting with Similarity Digests , 2010, IFIP Int. Conf. Digital Forensics.

[10]  Alfred Menezes,et al.  Handbook Of Applied Cryptography Crc Press , 2015 .

[11]  Vassil Roussev,et al.  Evaluating detection error trade-offs for bytewise approximate matching algorithms , 2014, Digit. Investig..

[12]  Vassil Roussev,et al.  Approximate Matching: Definition and Terminology , 2014 .

[13]  Quynh H. Dang,et al.  Secure Hash Standard | NIST , 2015 .

[14]  Sangjin Lee,et al.  Detecting Similar Files Based on Hash and Statistical Analysis for Digital Forensic Investigation , 2009, 2009 2nd International Conference on Computer Science and its Applications.

[15]  Yixin Chen,et al.  md5bloom: Forensic filesystem hashing revisited , 2006, Digit. Investig..

[16]  Caitlin Sadowski SimHash : Hash-based Similarity Detection , 2007 .

[17]  A. Ludwig,et al.  2nd International Conference , 2007 .

[18]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[19]  Vassil Roussev,et al.  An evaluation of forensic similarity hashes , 2011, Digit. Investig..

[20]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[21]  Harald Baier,et al.  FRASH: A framework to test algorithms of similarity hashing , 2013, Digit. Investig..

[22]  Vassil Roussev Managing Terabyte-Scale Investigations with Similarity Digests , 2012, IFIP Int. Conf. Digital Forensics.