A Hybrid Approach to Private Record Linkage

Real-world entities are not always represented by the same set of features in different data sets. Therefore matching and linking records corresponding to the same real-world entity distributed across these data sets is a challenging task. If the data sets contain private information, the problem becomes even harder due to privacy concerns. Existing solutions of this problem mostly follow two approaches: sanitization techniques and cryptographic techniques. The former achieves privacy by perturbing sensitive data at the expense of degrading matching accuracy. The later, on the other hand, attains both privacy and high accuracy under heavy communication and computation costs. In this paper, we propose a method that combines these two approaches and enables users to trade off between privacy, accuracy and cost. Experiments conducted on real data sets show that our method has significantly lower costs than cryptographic techniques and yields much more accurate matching results compared to sanitization techniques, even when the data sets are perturbed extensively.

[1]  Peter Christen,et al.  Some methods for blindfolded record linkage , 2004, BMC Medical Informatics Decis. Mak..

[2]  Dongwon Lee,et al.  Blocking-aware private record linkage , 2005, IQIS '05.

[3]  Monica Scannapieco,et al.  Towards an Open Source Toolkit for Building Record Linkage Workflows , 2006 .

[4]  Chris Clifton,et al.  Privacy-preserving data integration and sharing , 2004, DMKD '04.

[5]  Qi Wang,et al.  On the privacy preserving properties of random data perturbation techniques , 2003, Third IEEE International Conference on Data Mining.

[6]  Murat Kantarcioglu,et al.  Sovereign Joins , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[7]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[8]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[9]  Ramakrishnan Srikant,et al.  Privacy-preserving data mining , 2000, SIGMOD '00.

[10]  Oded Goldreich,et al.  Foundations of Cryptography: General Cryptographic Protocols , 2004 .

[11]  Elisa Bertino,et al.  Privacy preserving schema and data matching , 2007, SIGMOD '07.

[12]  Dawn Xiaodong Song,et al.  Privacy-Preserving Set Operations , 2005, CRYPTO.

[13]  Shanti Gomatam,et al.  An empirical comparison of record linkage procedures , 2002, Statistics in medicine.

[14]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[15]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[16]  Catherine Quantin,et al.  How to ensure data security of an epidemiological follow-up: quality assessment of an anonymous record linkage procedure , 1998, Int. J. Medical Informatics.

[17]  Divyakant Agrawal,et al.  Privacy Preserving Query Processing Using Third Parties , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[18]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[19]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[20]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[21]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[22]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[23]  Alexandre V. Evfimievski,et al.  Information sharing across private databases , 2003, SIGMOD '03.

[24]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[25]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.