Secure Similar Document Detection with Simhash

Similar document detection is a well-studied problem with important application domains, such as plagiarism detection, document archiving, and patent/copyright protection. Recently, the research focus has shifted towards the privacy-preserving version of the problem, in which two parties want to identify similar documents within their respective datasets. These methods apply to scenarios such as patent protection or intelligence collaboration, where the contents of the documents at both parties should be kept secret. Nevertheless, existing protocols on secure similar document detection suffer from high computational and/or communication costs, which renders them impractical for large datasets. In this work, we introduce a solution based on simhash document fingerprints, which essentially reduce the problem to a secure XOR computation between two bit vectors. Our experimental results demonstrate that the proposed method improves the computational and communication costs by at least one order of magnitude compared to the current state-of-the-art protocol. Moreover, it achieves a high level of precision and recall.

[1]  Wei Jiang,et al.  N-Gram Based Secure Similar Document Detection , 2011, DBSec.

[2]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[3]  Emiliano De Cristofaro,et al.  Fast and Private Computation of Set Intersection Cardinality , 2011, IACR Cryptol. ePrint Arch..

[4]  A. Yao,et al.  Fair exchange with a semi-trusted third party (extended abstract) , 1997, CCS '97.

[5]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[6]  Lei Wang,et al.  Achieving both high precision and high recall in near-duplicate detection , 2008, CIKM '08.

[7]  Chris Clifton,et al.  Efficient privacy-preserving similar document detection , 2010, The VLDB Journal.

[8]  Taher ElGamal,et al.  A public key cyryptosystem and signature scheme based on discrete logarithms , 1985 .

[9]  Ronald Cramer,et al.  A secure and optimally efficient multi-authority election scheme , 1997, Eur. Trans. Telecommun..

[10]  Moni Naor,et al.  Computationally Secure Oblivious Transfer , 2004, Journal of Cryptology.

[11]  Chris Clifton,et al.  Similar Document Detection with Limited Information Disclosure , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[13]  Emiliano De Cristofaro,et al.  EsPRESSo: Efficient Privacy-Preserving Evaluation of Sample Set Similarity , 2012, DPM/SETOP.

[14]  Yehuda Lindell,et al.  Secure Multiparty Computation for Privacy-Preserving Data Mining , 2009, IACR Cryptol. ePrint Arch..