An evaluation of forensic similarity hashes

The fast growth of the average size of digital forensic targets demands new automated means to quickly, accurately and reliably correlate digital artifacts. Such tools need to offer more flexibility than the routine known-file filtering based on crypto hashes. Currently, there are two tools for which NIST has produced reference hash sets-ssdeep and sdhash. The former provides a fixed-sized fuzzy hash based on random polynomials, whereas the latter produces a variable-length similarity digest based on statistically-identified features packed into Bloom filters. This study provides a baseline evaluation of the capabilities of these tools both in a controlled environment and on real-world data. The results show that the similarity digest approach significantly outperforms in terms of recall and precision in all tested scenarios and demonstrates robust and scalable behavior.