Class-Aware Similarity Hashing for Data Classification

This paper introduces “class-aware similarity hashes” or “classprints,” which are an outgrowth of recent work on similarity hashing. The approach builds on the notion of context-based hashing to create a framework for identifying data types based on content and for building characteristic similarity hashes for individual data items that can be used for correlation. The principal benefits are that data classification can be fully automated and that a priori knowledge of the underlying data is not necessary beyond the availability of a suitable training set.