TLSH -- A Locality Sensitive Hash

Cryptographic hashes such as MD5 and SHA-1 are used for many data mining and security applications -- they are used as an identifier for files and documents. However, if a single byte of a file is changed, then cryptographic hashes result in a completely different hash value. It would be very useful to work with hashes which identify that files were similar based on their hash values. The security field has proposed similarity digests, and the data mining community has proposed locality sensitive hashes. Some proposals include the Nilsimsa hash (a locality sensitive hash), Ssdeep and Sdhash (both Ssdeep and Sdhash are similarity digests). Here, we describe a new locality sensitive hashing scheme the TLSH. We provide algorithms for evaluating and comparing hash values and provide a reference to its open source code. We do an empirical evaluation of publically available similarity digest schemes. The empirical evaluation highlights significant problems with previously proposed schemes; the TLSH scheme does not suffer from the flaws identified.