Research and design of similar file forensics system based on fuzzy hash

The collection and identification of digital evidence is an essential procedure in file forensics, which contains manual retrieval, traditional hash techniques and query by keywords techniques etc. For the vulnerability of electronic documents, it is easy to be changed or tampered with. So looking for files similar with target files becomes important for forensic. However, the traditional forensic system is usually based on searching for keywords or just scan the entire files, both lack of high enough speed and accuracy to support nowadays forensic tasks. Considering the fuzzy hash algorithm is of great value to calculating the similarity rate between files, in this paper, we analyzed the process and the improvement of the fuzzy hash algorithm, and verified the accuracy and efficiency of the improved algorithm, we innovatively applied fuzzy hash technology to the field of file forensic and designed a set of more adaptable and more accurate files forensic system, which follows the process of the acquisition of storage media, the collection of evident files, and the preservation of evident files, combined with text mining, data recovery technology, text clustering, classification, and some other technologies We believed that this system is a breakthrough of existing problems in file forensics field such like large manual workload and low accuracy.