Randomized Response and Balanced Bloom Filters for Privacy Preserving Record Linkage

In most European settings, record linkage across different institutions is based on encrypted personal identifiers - such as names, birthdays, or places of birth - to protect privacy. However, in practice up to 20% of the records may contain errors in identifiers. Thus, exact record linkage on encrypted identifiers usually results in the loss of large subsets of the data. Such losses usually imply biased statistical estimates since the causes of errors might be correlated with the variables of interest in many applications. Over the past 10 years, the field of Privacy Preserving Record Linkage (PPRL) has developed different techniques to link data without revealing the identity of the described entity. However, only few techniques are suitable for applied research with large data bases that include millions of records, which is typical for administrative or medical data bases. Bloom filters were found to be one successful technique for PPRL when large scale applications are concerned. Yet, Bloom filters have been subject to cryptographic attacks. Previous research has shown that the straight application of Bloom filters has a non-zero re-identification risk. We present new results on recently developed techniques defying all known attacks on PPRL Bloom filters. The computationally inexpensive algorithms modify personal identifiers by combining different cryptographic techniques. The paper demonstrates these new algorithms and demonstrates their performance concerning precision, recall, and re-identification risk on large data bases.

[1]  Dinusha Vatsalan,et al.  Scalable and approximate privacy-preserving record linkage , 2014 .

[2]  K. Tomashek,et al.  U.S. Maternally linked birth records may be biased for Hispanics and other population groups. , 2010, Annals of epidemiology.

[3]  Peter Christen,et al.  Febrl: a freely available record linkage system with a graphical user interface , 2008 .

[4]  Christian Borgs,et al.  High quality linkage using Multibit Trees for privacy-preserving blocking , 2017, International Journal of Population Data Science.

[5]  Donald E. Knuth,et al.  Efficient balanced codes , 1986, IEEE Trans. Inf. Theory.

[6]  Anne-Marie Kermarrec,et al.  BLIP: Non-interactive Differentially-Private Similarity Computation on Bloom filters , 2012, SSS.

[7]  Thomas P. Jakobsen,et al.  A Fast Method for the Cryptanalysis of Substitution Ciphers , 1995 .

[8]  Rob Hall,et al.  Privacy-Preserving Record Linkage , 2010, Privacy in Statistical Databases.

[9]  Rainer Schnell,et al.  A Novel Error-Tolerant Anonymous Linking Code , 2011 .

[10]  H. Goldstein,et al.  Evaluating bias due to data linkage error in electronic healthcare records , 2014, BMC Medical Research Methodology.

[11]  Josep Domingo-Ferrer,et al.  New directions in anonymization: Permutation paradigm, verifiability by subjects and intruders, transparency to users , 2015, Inf. Sci..

[12]  Eran Omri,et al.  Distributed Private Data Analysis: On Simultaneously Solving How and What , 2008, CRYPTO.

[13]  Andreas Ziegler,et al.  Applied Missing Data Analysis in the Health Sciences. X.‐H. Zhou, C. Zhou, D. Liu, and X. Ding (2014). Hoboken: John Wiley & Sons. 256 pages, ISBN: 978‐0‐470‐52381‐0 (hardback); ISBN: 978‐1‐118‐57364‐8 (eBook). , 2015 .

[14]  Murat Kantarcioglu,et al.  A Constraint Satisfaction Cryptanalysis of Bloom Filters in Private Record Linkage , 2011, PETS.

[15]  Tobias Bachteler,et al.  Similarity Filtering with Multibit Trees for Record Linkage , 2013 .

[16]  Peter Christen,et al.  A taxonomy of privacy-preserving record linkage techniques , 2013, Inf. Syst..

[17]  Murat Kantarcioglu,et al.  A practical approach to achieve private medical record linkage in light of public resources , 2013, J. Am. Medical Informatics Assoc..

[18]  Michael Mitzenmacher,et al.  Less Hashing, Same Performance: Building a Better Bloom Filter , 2006, ESA.

[19]  Jay M. Berger A Note on Error Detection Codes for Asymmetric Channels , 1961, Inf. Control..

[20]  William Stallings,et al.  Cryptography and network security - principles and practice (3. ed.) , 2014 .

[21]  Christian N. S. Pedersen,et al.  A tree-based method for the rapid screening of chemical fingerprints , 2009, Algorithms for Molecular Biology.

[22]  Úlfar Erlingsson,et al.  Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries , 2015, Proc. Priv. Enhancing Technol..

[23]  Stasha Ann Bown Larsen,et al.  Record Linkage , 2018, Encyclopedia of Database Systems.

[24]  Úlfar Erlingsson,et al.  RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[25]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[26]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[27]  R. Schnell Linking Surveys and Administrative Data , 2013 .

[28]  William Stallings,et al.  Cryptography and Network Security: Principles and Practice , 1998 .

[29]  Joseph T. Lariscy,et al.  Differential Record Linkage by Hispanic Ethnicity and Age in Linked Mortality Studies , 2011, Journal of aging and health.

[30]  Dario Gregori,et al.  The impact of record-linkage bias in the Cox model. , 2010, Journal of evaluation in clinical practice.

[31]  Owen Abbott,et al.  Large‐scale linkage for total populations in official statistics , 2016 .

[32]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[33]  Rainer Schnell,et al.  Bmc Medical Informatics and Decision Making Privacy-preserving Record Linkage Using Bloom Filters , 2022 .

[34]  Rainer Schnell,et al.  Cryptanalysis of Basic Bloom Filters Used for Privacy Preserving Record Linkage , 2014, J. Priv. Confidentiality.

[35]  M. Strippoli,et al.  Cohort profile: the Swiss childhood cancer survivor study. , 2012, International journal of epidemiology.

[36]  Michael Mitzenmacher,et al.  Less hashing, same performance: Building a better Bloom filter , 2006, Random Struct. Algorithms.

[37]  Christian Borgs,et al.  Building a National Perinatal Data Base without the Use of Unique Personal Identifiers , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[38]  Martin Kroll,et al.  Automated Cryptanalysis of Bloom Filter Encryptions of Health Records , 2014, HEALTHINF.

[39]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[40]  Rainer Schnell,et al.  An efficient privacy-preserving record linkage technique for administrative data and censuses , 2014 .

[41]  Murat Kantarcioglu,et al.  Private medical record linkage with approximate matching. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.