N-Gram Based Secure Similar Document Detection

Secure similar document detection (SSDD) plays an important role in many applications, such as justifying the need-to-know basis and facilitating communication between government agencies. The SSDD problem considers situations where Alice with a query document wants to find similar information from Bob's document collection. During this process, the content of the query document is not disclosed to Bob, and Bob's document collection is not disclosed to Alice. Existing SSDD protocols are developed under the vector space model, which has the advantage of identifying global similar information. To effectively and securely detect similar documents with overlapping text fragments, this paper proposes a novel n-gram based SSDD protocol.

[1]  Oded Goldreich Foundations of Cryptography: Encryption Schemes , 2004 .

[2]  Chris Clifton,et al.  Similar Document Detection with Limited Information Disclosure , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[3]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[4]  Vlado Keselj,et al.  Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering , 2005, CIKM '05.

[5]  Grace Hui Yang,et al.  Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[6]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[7]  Niv Ahituv,et al.  Processing encrypted data , 1987, CACM.

[8]  Chris Clifton,et al.  Efficient privacy-preserving similar document detection , 2010, The VLDB Journal.

[9]  Benny Pinkas,et al.  Efficient Private Matching and Set Intersection , 2004, EUROCRYPT.

[10]  JiangWei,et al.  Efficient privacy-preserving similar document detection , 2010, VLDB 2010.

[11]  Christian S. Collberg,et al.  SPLAT: A System for Self-Plagiarism Detection , 2003, ICWI.

[12]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[13]  Peter Christen,et al.  Blind Data Linkage Using n-gram Similarity Comparisons , 2004, PAKDD.

[14]  Johannes Gehrke,et al.  Plagiarism Detection in arXiv , 2006, Sixth International Conference on Data Mining (ICDM'06).

[15]  Gu Si-yang,et al.  Privacy preserving association rule mining in vertically partitioned data , 2006 .

[16]  Yvo Desmedt,et al.  Encryption Schemes , 1999, Algorithms and Theory of Computation Handbook.

[17]  Jacques Stern,et al.  Advances in Cryptology — EUROCRYPT ’99 , 1999, Lecture Notes in Computer Science.

[18]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[19]  Rainer Schnell,et al.  Bmc Medical Informatics and Decision Making Privacy-preserving Record Linkage Using Bloom Filters , 2022 .

[20]  Oded Goldreich,et al.  Foundations of Cryptography: Volume 2, Basic Applications , 2004 .

[21]  Silvio Micali,et al.  The knowledge complexity of interactive proof-systems , 1985, STOC '85.

[22]  Choonsik Park,et al.  Information Security and Cryptology - ICISC 2004, 7th International Conference, Seoul, Korea, December 2-3, 2004, Revised Selected Papers , 2005, ICISC.

[23]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[24]  Rynson W. H. Lau,et al.  CHECK: a document plagiarism detection system , 1997, SAC '97.

[25]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[26]  Dawn Xiaodong Song,et al.  Privacy-Preserving Set Operations , 2005, CRYPTO.

[27]  Stan Matwin,et al.  Privacy-Preserving Collaborative Association Rule Mining , 2005, ICEB.

[28]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[29]  Hakan Hacigümüs,et al.  Executing SQL over encrypted data in the database-service-provider model , 2002, SIGMOD '02.

[30]  Milad Shokouhi,et al.  Compact Features for Detection of Near-Duplicates in Distributed Retrieval , 2006, SPIRE.

[31]  Mikhail J. Atallah,et al.  Private collaborative forecasting and benchmarking , 2004, WPES '04.

[32]  Wenliang Du,et al.  Privacy-preserving cooperative statistical analysis , 2001, Seventeenth Annual Computer Security Applications Conference.

[33]  Bart Goethals,et al.  On Private Scalar Product Computation for Privacy-Preserving Data Mining , 2004, ICISC.

[34]  Eyal Kushilevitz,et al.  Private information retrieval , 1998, JACM.

[35]  Oded Goldreich,et al.  Foundations of Cryptography: General Cryptographic Protocols , 2004 .

[36]  Silvio Micali,et al.  Probabilistic Encryption , 1984, J. Comput. Syst. Sci..