Efficient and Scalable Privacy-Preserving Similar Document Detection

Similar document detection has been well studied for many applications, such as file management systems, plagiarism and double submission detection. Traditional detection algorithms are challenged by the privacy-preserving problems. Recently, privacy-preserving similar document detection between two parties gains more attention. However, most of the existing works mainly focus on computing similarity between two documents, and they are inefficient with O(n2) computation complexity when processing secure comparison between two n-document sets. Focusing on this problem, this paper presents a new efficient and scalable privacy-preserving similar document detection protocol based on oblivious multi-garbled Bloom filter intersection and MinHash algorithm. Experimental evaluation shows that when processing large document sets, our protocol still remains linear computation complexity with the scale of document sets increasing and achieves overwhelming computational performance improvement against other major approaches.

[1]  Changyu Dong,et al.  When private set intersection meets big data: an efficient and scalable protocol , 2013, CCS.

[2]  Yuval Ishai,et al.  Extending Oblivious Transfers Efficiently , 2003, CRYPTO.

[3]  Oded Goldreich,et al.  The Foundations of Cryptography - Volume 2: Basic Applications , 2001 .

[4]  Amal El-Maazawi Bloom Filters — A Tutorial , Analysis , and Survey , 2022 .

[5]  Emiliano De Cristofaro,et al.  EsPRESSo: Efficient Privacy-Preserving Evaluation of Sample Set Similarity , 2012, DPM/SETOP.

[6]  Chris Clifton,et al.  Efficient privacy-preserving similar document detection , 2010, The VLDB Journal.

[7]  Adi Shamir,et al.  How to share a secret , 1979, CACM.

[8]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[9]  Moni Naor,et al.  Efficient oblivious transfer protocols , 2001, SODA '01.

[10]  Michael O. Rabin,et al.  How To Exchange Secrets with Oblivious Transfer , 2005, IACR Cryptol. ePrint Arch..

[11]  Chris Clifton,et al.  Similar Document Detection with Limited Information Disclosure , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Wei Jiang,et al.  N-Gram Based Secure Similar Document Detection , 2011, DBSec.

[13]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[14]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[15]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[16]  G. R. BLAKLEY Safeguarding cryptographic keys , 1979, 1979 International Workshop on Managing Requirements Knowledge (MARK).

[17]  Donald Beaver,et al.  Correlated pseudorandomness and the complexity of private computations , 1996, STOC '96.

[18]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[19]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[20]  Spiridon Bakiras,et al.  Secure Similar Document Detection with Simhash , 2013, Secure Data Management.

[21]  Grace Hui Yang,et al.  Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[22]  Oded Goldreich,et al.  Foundations of Cryptography: Volume 2, Basic Applications , 2004 .