Real Time Network File Similarity Detection Based on Approximate Matching

Real-time packet inspection becomes a hot topic as it is needed in many applications such as spam and virus detection, intrusion and attack detection, and collection of statistics. To have an efficient inspection, most of the traditional techniques use exact matches on keyword and/or white/black MD5 lists. However, it is well-known that exact matches may not be effective to identify similar files such as the same videos with small changes, e.g. titles, posted by different users or metamorphic viruses (mutated computer viruses). However, approximate matching (e.g. fuzzy hashing) is known to be more robust to identify similar files and has been proven to be effective in digital forensics. In this paper, we try to confirm that by using an appropriate approximate matching approach, it is feasible and effective to inspect real-time traffic in order to identify similar files. Our experiments with real data show that our solution achieves good usability in practical. In particular, on a typical file detection scenario, we obtained an algorithm throughput of over 46MB/s.