Dynamic Similarity for Fields with NULL Values

One of the most important tasks in data cleansing is to de-duplicate records, which needs to compare records to determine their equivalence. However, existing comparison methods, such as Record Similarity, Equational Theory, implicitly assume that the values in all fields are known, and NULL values are treated as empty strings, which will result in a loss of correct duplicate records. In this paper, we solve this problem by proposing a simple yet efficient method, Dynamic Similarity, which dynamically adjusts the similarity for field with NULL value. Performance results on real and synthetic datasets show that Dynamic Similarity method can achieve more correct duplicate records and without introducing more false positives as compared with Record Similarity. Furthermore, the percentage of correct duplicate records obtained by Dynamic Similarity but not obtained by Record Similarity will increase if the number of fields with NULL values increases.