A Comparison of Fast Blocking Methods for Record Linkage

linkage of millions of individual health records for ethically-approved research purposes is a computationally expensive task. Blocking methods are used in record link- age systems to reduce the number of candidate record com- parison pairs to a feasible number whilst still maintaining linkage accuracy. New blocking methods have been imple- mented recently using high-dimensional indexing or cluster- ing algorithms. We compare two new blocking methods, bigram indexing and canopy clustering with TFIDF (Term Frequency/Inverse Document Frequency), with two older methods of standard traditional blocking and sorted neighbourhood blocking. The results show that recently blocking methods such as bigram indexing and canopy clustering provide scalable blocking methods while maintaining or improving upon record link- age accuracy. There is a potential for large performance speed-ups and better accuracy to be achieved by these new blocking methods.