A distributed near-optimal LSH-based framework for privacy-preserving record linkage

In this paper, we present a framework which relies on the Map/Reduce paradigm in order to distribute computations among underutilized commodity hardware resources uniformly, without imposing an extra overhead on the existing infrastructure. The volume of the distance computations, required for records comparison, is largely reduced by utilizing the so-called Locality-Sensitive Hashing technique, which is optimally tuned in order to avoid highly redundant computations. Experimental results illustrate the effectiveness of our distributed framework in finding the matched record pairs in voluminous data sets.

[1]  Peter Christen,et al.  Some methods for blindfolded record linkage , 2004, BMC Medical Informatics Decis. Mak..

[2]  Jure Leskovec,et al.  Mining of Massive Datasets: Finding Similar Items , 2011 .

[3]  Hugo Krawczyk,et al.  HMAC: Keyed-Hashing for Message Authentication , 1997, RFC.

[4]  Lifang Gu,et al.  Privacy-Preserving Fuzzy Matching Using a Public Reference Table , 2009 .

[5]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[6]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[7]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[8]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[9]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[10]  Murat Kantarcioglu,et al.  Composite Bloom Filters for Secure Record Linkage , 2014, IEEE Transactions on Knowledge and Data Engineering.

[11]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[12]  Elisa Bertino,et al.  Efficient privacy-aware record integration , 2013, EDBT '13.

[13]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[14]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[15]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[16]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[17]  Philip S. Yu,et al.  The IGrid index: reversing the dimensionality curse for similarity indexing in high dimensional space , 2000, KDD '00.

[18]  Murat Kantarcioglu,et al.  A Constraint Satisfaction Cryptanalysis of Bloom Filters in Private Record Linkage , 2011, PETS.

[19]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[20]  Elisa Bertino,et al.  A Hybrid Approach to Private Record Linkage , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[21]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[22]  Peter Christen,et al.  Efficient two-party private blocking based on sorted nearest neighborhood clustering , 2013, CIKM.

[23]  Rainer Schnell,et al.  Bmc Medical Informatics and Decision Making Privacy-preserving Record Linkage Using Bloom Filters , 2022 .

[24]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[25]  Trevor Darrell,et al.  Locality-Sensitive Hashing Using Stable Distributions , 2006 .

[26]  Bernard P. Zajac Applied cryptography: Protocols, algorithms, and source code in C , 1994 .

[27]  Dongwon Lee,et al.  HARRA: fast iterative hashed record linkage for large-scale data collections , 2010, EDBT '10.

[28]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[29]  Vassilios S. Verykios,et al.  A Sorted Neighborhood Approach to Multidimensional Privacy Preserving Blocking , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[30]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[31]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[32]  Elisa Bertino,et al.  Privacy preserving schema and data matching , 2007, SIGMOD '07.

[33]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[34]  Elisa Bertino,et al.  Private record matching using differential privacy , 2010, EDBT '10.

[35]  Peter Christen,et al.  A taxonomy of privacy-preserving record linkage techniques , 2013, Inf. Syst..

[36]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[37]  Gert Vegter,et al.  In handbook of discrete and computational geometry , 1997 .

[38]  Vassilios S. Verykios,et al.  A distributed framework for scaling Up LSH-based computations in privacy preserving record linkage , 2013, BCI '13.

[39]  Stanley Trepetin Privacy-Preserving String Comparisons in Record Linkage Systems: A Review , 2008, Inf. Secur. J. A Glob. Perspect..

[40]  Peter Christen,et al.  An Efficient Two-Party Protocol for Approximate Matching in Private Record Linkage , 2011, AusDM.

[41]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[42]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[43]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .