CheetahER: A Fast Entity Resolution System for Heterogeneous Camera Data

The SIGMOD Programming Contest 2020 raises a real-world entity resolution problem, which requires to identify product specifications from multiple e-commerce websites that represent the same real-world cameras. Entity resolution has been extensively studied and the general solution framework consists of two phases: blocking and matching. Most existing works focus on the matching phase, which trains (complex) models on large volumes of data and uses the models to decide whether a pair of descriptions refers to the same real-world object. However, training a high-quality model is difficult for the SIGMOD contest because there is only a limited amount of labeled data and the product specifications can be dirty and incomplete. In this paper, we propose CheetahER, an accurate and efficient entity resolution system. Different from existing works, we focus on improving the effectiveness of the blocking phase, which is overlooked in both academia researches and industry systems, and propose a two-phase blocking framework to group the product specifications according to brand and model. The pre-processing and data cleaning procedures are also carefully designed to improve data quality. CheetahER ranks the 1st in accuracy among 53 teams and completes the task within 20 seconds. Even though some designs of CheetahER are specialized for camera datasets, its novel two-phase blocking framework and operators (i.e., merging and splitting) may generalize to other entity resolution tasks.