Secure pseudonymisation for privacy-preserving probabilistic record linkage

Abstract Record linkage is becoming an increasingly important tool in many areas of research – particularly medical research, where the relevant data often reside in more than one location. In the absence of a reliable and unique identifier probabilistic approaches to linkage are often employed. This linkage generally exploits the information contained in the fields that are common to a record pair. In classical record linkage the values in common fields are simply compared for equality. As values might contain typographical (or other) errors the performance of classical record linkage can often be significantly improved if similarities between value pairs are also exploited. In applications where the data used for matching must be kept private the raw values are replaced by pseudonyms. For better linkage performance these pseudonyms should also convey information regarding similarities. Existing approaches are often based on Bloom filters, yet these are susceptible to attack. Secure schemes based on Bloom filters inevitably involve additional security measures. Here we introduce a new scheme that produces pseudonyms that are far more secure than Bloom filters. It can be used a drop-in replacement for many schemes that use Bloom filters. The new scheme allows similarity scores to be estimated from pairs of pseudonyms with negligible bias and with known variance for a given similarity score.

[1]  Rainer Schnell,et al.  Bmc Medical Informatics and Decision Making Privacy-preserving Record Linkage Using Bloom Filters , 2022 .

[2]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[3]  Mikkel Thorup,et al.  Approximately Minwise Independence with Twisted Tabulation , 2014, SWAT.

[4]  Sean M. Randall,et al.  Privacy preserving record linkage using homomorphic encryption , 2015 .

[5]  Tobias Bachteler,et al.  Similarity Filtering with Multibit Trees for Record Linkage , 2013 .

[6]  Michael Mitzenmacher,et al.  Less Hashing, Same Performance: Building a Better Bloom Filter , 2006, ESA.

[7]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[8]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[9]  Mikkel Thorup,et al.  The power of simple tabulation hashing , 2010, STOC.

[10]  Larry Carter,et al.  New Hash Functions and Their Use in Authentication and Set Equality , 1981, J. Comput. Syst. Sci..

[11]  Peter Wegner,et al.  A technique for counting ones in a binary computer , 1960, CACM.

[12]  Larry Carter,et al.  New classes and applications of hash functions , 1979, 20th Annual Symposium on Foundations of Computer Science (sfcs 1979).

[13]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[14]  Pierre Baldi,et al.  Mathematical Correction for Fingerprint Similarity Measures to Improve Chemical Retrieval , 2007, J. Chem. Inf. Model..

[15]  Ping Li,et al.  Theory and applications of b-bit minwise hashing , 2011, Commun. ACM.

[16]  Peter Christen,et al.  Some methods for blindfolded record linkage , 2004, BMC Medical Informatics Decis. Mak..

[17]  Christian N. S. Pedersen,et al.  A Tree Based Method for the Rapid Screening of Chemical Fingerprints , 2009, WABI.

[18]  P Crosignani,et al.  The EpiLink Record Linkage Software , 2005, Methods of Information in Medicine.