Privacy preserving similarity joins using MapReduce

Abstract Similarity join is an essential operator in data processing, mining and analysis. However, it is resource intensive and time consuming, particularly when processing big data. There is also a need to ensure data confidentiality in similarity joins, as joining between two files may result in personal information disclosure. Based on these two considerations, this paper proposes a MapReduce-based similarity joins with differential privacy technology (hereafter, referred to as PSJoin). The proposed parallel algorithm is designed to achieve high efficiency, in terms of answering similarity join queries privately and effectively. Specifically, the use of PSJoin ensures the preservation of privacy during the similarity join process and in the published results. A new private global ordering approach is presented in this paper, which is designed to deal with potential disclosure issues during the process, and a differential private similarity function is provided for this algorithm. Findings from our evaluations using large-scale real-world datasets demonstrate that our method can effectively guarantee privacy with only minimal accuracy loss in similarity queries, while offering good scalability consistently.

[1]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[2]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[3]  Andreas Haeberlen,et al.  DJoin: differentially private join queries over distributed databases , 2012, OSDI 2012.

[4]  Anthony K. H. Tung,et al.  Efficient and Scalable Processing of String Similarity Join , 2013, IEEE Transactions on Knowledge and Data Engineering.

[5]  Keke Gai,et al.  Privacy-Preserving Content-Oriented Wireless Communication in Internet-of-Things , 2018, IEEE Internet of Things Journal.

[6]  Zhao Zhang,et al.  Supporting user authorization queries in RBAC systems by role-permission reassignment , 2018, Future Gener. Comput. Syst..

[7]  Ulf Leser,et al.  Set Similarity Joins on MapReduce: An Experimental Survey , 2018, Proc. VLDB Endow..

[8]  Bhaskar DasGupta,et al.  On analyzing and evaluating privacy measures for social networks under active attack , 2018, Inf. Sci..

[9]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[10]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[11]  Emiliano De Cristofaro,et al.  (If) Size Matters: Size-Hiding Private Set Intersection , 2011, IACR Cryptol. ePrint Arch..

[12]  Hai Jin,et al.  Privacy-Preserving Multi-Keyword Top-$k$ k Similarity Search Over Encrypted Data , 2019, IEEE Trans. Dependable Secur. Comput..

[13]  Vassilios S. Verykios,et al.  An LSH-Based Blocking Approach with a Homomorphic Matching Technique for Privacy-Preserving Record Linkage , 2015, IEEE Transactions on Knowledge and Data Engineering.

[14]  Guoliang Li,et al.  MassJoin: A mapreduce-based method for scalable string similarity joins , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[15]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[16]  Shouling Ji,et al.  Structural Data De-Anonymization: Theory and Practice , 2016, IEEE/ACM Transactions on Networking.

[17]  Kim-Kwang Raymond Choo,et al.  Enabling verifiable multiple keywords search over encrypted cloud data , 2018, Inf. Sci..

[18]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[19]  Guoliang Li,et al.  K-Join: Knowledge-Aware Similarity Join , 2016, IEEE Trans. Knowl. Data Eng..

[20]  Benny Pinkas,et al.  Scalable Private Set Intersection Based on OT Extension , 2018, IACR Cryptol. ePrint Arch..

[21]  Fenghua Li,et al.  Server-aided private set intersection based on reputation , 2017, Inf. Sci..

[22]  Weidong Xiao,et al.  Fast top-k similarity join for SimRank , 2017, Inf. Sci..

[23]  Johannes Gehrke,et al.  iReduct: differential privacy with reduced relative errors , 2011, SIGMOD '11.

[24]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[25]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[26]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[27]  Yan Li,et al.  A distributed ensemble approach for mining healthcare data under privacy constraints , 2016, Inf. Sci..

[28]  Yi Liu,et al.  Secure multi-label data classification in cloud by additionally homomorphic encryption , 2018, Inf. Sci..

[29]  Hongwei Li,et al.  Secure Multi-Party Computation: Theory, practice and applications , 2019, Inf. Sci..

[30]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[31]  Xike Xie,et al.  Embedding differential privacy in decision tree algorithm with different depths , 2016, Science China Information Sciences.

[32]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[33]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[34]  Gang Chen,et al.  Metric Similarity Joins Using MapReduce , 2017, IEEE Transactions on Knowledge and Data Engineering.

[35]  Beng Chin Ooi,et al.  M2R: Enabling Stronger Privacy in MapReduce Computation , 2015, USENIX Security Symposium.

[36]  John Riedl,et al.  You are what you say: privacy risks of public mentions , 2006, SIGIR '06.

[37]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[38]  Cong Wang,et al.  Privacy-Preserving Similarity Joins Over Encrypted Data , 2017, IEEE Transactions on Information Forensics and Security.

[39]  Shiho Moriai,et al.  Privacy-Preserving Deep Learning via Additively Homomorphic Encryption , 2018, IEEE Transactions on Information Forensics and Security.

[40]  Ray R. Larson Introduction to Information Retrieval , 2010 .

[41]  Josep Domingo-Ferrer,et al.  Big Data Privacy: Challenges to Privacy Principles and Models , 2015, Data Science and Engineering.

[42]  Keke Gai,et al.  Privacy-Preserving Energy Trading Using Consortium Blockchain in Smart Grid , 2019, IEEE Transactions on Industrial Informatics.

[43]  Keke Gai,et al.  Blend Arithmetic Operations on Tensor-Based Fully Homomorphic Encryption Over Real Numbers , 2018, IEEE Transactions on Industrial Informatics.