Cryptographic hashes such as MD5 and SHA-1 are used for many data mining and security applications -- they are used as an identifier for files and documents. However, if a single byte of a file is changed, then cryptographic hashes result in a completely different hash value. It would be very useful to work with hashes which identify that files were similar based on their hash values. The security field has proposed similarity digests, and the data mining community has proposed locality sensitive hashes. Some proposals include the Nilsimsa hash (a locality sensitive hash), Ssdeep and Sdhash (both Ssdeep and Sdhash are similarity digests). Here, we describe a new locality sensitive hashing scheme the TLSH. We provide algorithms for evaluating and comparing hash values and provide a reference to its open source code. We do an empirical evaluation of publically available similarity digest schemes. The empirical evaluation highlights significant problems with previously proposed schemes; the TLSH scheme does not suffer from the flaws identified.
[1]
Ernesto Damiani,et al.
An Open Digest-based Technique for Spam Detection
,
2004,
PDCS.
[2]
Jesse D. Kornblum.
Identifying almost identical files using context triggered piecewise hashing
,
2006,
Digit. Investig..
[3]
Ashish Goel,et al.
Similarity search and locality sensitive hashing using ternary content addressable memories
,
2010,
SIGMOD Conference.
[4]
Vassil Roussev,et al.
An evaluation of forensic similarity hashes
,
2011,
Digit. Investig..
[5]
Vassil Roussev,et al.
Data Fingerprinting with Similarity Digests
,
2010,
IFIP Int. Conf. Digital Forensics.
[6]
Peter K. Pearson,et al.
Fast hashing of variable-length text strings
,
1990,
CACM.