The fast growth of the average size of digital forensic targets demands new automated means to quickly, accurately and reliably correlate digital artifacts. Such tools need to offer more flexibility than the routine known-file filtering based on crypto hashes. Currently, there are two tools for which NIST has produced reference hash sets-ssdeep and sdhash. The former provides a fixed-sized fuzzy hash based on random polynomials, whereas the latter produces a variable-length similarity digest based on statistically-identified features packed into Bloom filters. This study provides a baseline evaluation of the capabilities of these tools both in a controlled environment and on real-world data. The results show that the similarity digest approach significantly outperforms in terms of recall and precision in all tested scenarios and demonstrates robust and scalable behavior.
[1]
Burton H. Bloom,et al.
Space/time trade-offs in hash coding with allowable errors
,
1970,
CACM.
[2]
Vassil Roussev,et al.
Data Fingerprinting with Similarity Digests
,
2010,
IFIP Int. Conf. Digital Forensics.
[3]
Simson L. Garfinkel,et al.
Bringing science to digital forensics with standardized forensic corpora
,
2009,
Digit. Investig..
[4]
Andrei Broder,et al.
Network Applications of Bloom Filters: A Survey
,
2004,
Internet Math..
[5]
Vassil Roussev.
Building a Better Similarity Trap with Statistically Improbable Features
,
2009,
2009 42nd Hawaii International Conference on System Sciences.
[6]
Jesse D. Kornblum.
Identifying almost identical files using context triggered piecewise hashing
,
2006,
Digit. Investig..