Elxa: Scalable Privacy-Preserving Plagiarism Detection

One of the most challenging issues facing academic conferences and educational institutions today is plagiarism detection. Typically, these entities wish to ensure that the work products submitted to them have not been plagiarized from another source (e.g., authors submitting identical papers to multiple journals). Assembling large centralized databases of documents dramatically improves the effectiveness of plagiarism detection techniques, but introduces a number of privacy and legal issues: all document contents must be completely revealed to the database operator, making it an attractive target for abuse or attack. Moreover, this content aggregation involves the disclosure of potentially sensitive private content, and in some cases this disclosure may be prohibited by law. In this work, we introduce Elxa, the first scalable centralized plagiarism detection system that protects the privacy of the submissions. Elxa incorporates techniques from the current state of the art in plagiarism detection, as evaluated by the information retrieval community. Our system is designed to be operated on existing cloud computing infrastructure, and to provide incentives for the untrusted database operator to maintain the availability of the network. Elxa can be used to detect plagiarism in student work, duplicate paper submissions (and their associated peer reviews), similarities between confidential reports (e.g., malware summaries), or any approximate text reuse within a network of private documents. We implement a prototype using the Hadoop MapReduce framework, and demonstrate that it is feasible to achieve competitive detection effectiveness in the private setting.

[1]  Alexander F. Gelbukh,et al.  Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition , 2015, CLEF.

[2]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[3]  John R. Douceur,et al.  The Sybil Attack , 2002, IPTPS.

[4]  Pedro Rangel Henriques,et al.  Plagiarism Detection: A Tool Survey and Comparison , 2014, SLATE.

[5]  Ramayya Krishnan,et al.  Privacy-preserving similarity-based text retrieval , 2010, TOIT.

[6]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[7]  Chris Clifton,et al.  Similar Document Detection with Limited Information Disclosure , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[8]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[9]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[10]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[11]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[12]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC 13: Web and Hard Tracks , 2004, TREC.

[13]  Nasir D. Memon,et al.  Secure Sketch for Biometric Templates , 2006, ASIACRYPT.

[14]  Joan Feigenbaum,et al.  Systemization of Secure Computation , 2015 .

[15]  Vassil Roussev,et al.  An evaluation of forensic similarity hashes , 2011, Digit. Investig..

[16]  Jeffrey H. Meyerson,et al.  The Go Programming Language , 2014, IEEE Softw..

[17]  Tad Hogg,et al.  Enhancing privacy and trust in electronic communities , 1999, EC '99.

[18]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[19]  Brian W. Kernighan,et al.  The Go Programming Language , 2015 .

[20]  Grigori Sidorov,et al.  A Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014 , 2014, CLEF.

[21]  Martin Potthast,et al.  Technologies for Reusing Text from the Web , 2012 .

[22]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[23]  Matthias Hagen,et al.  Crowdsourcing Interaction Logs to Understand Text Reuse from the Web , 2013, ACL.

[24]  Matthias Hagen,et al.  Source Retrieval for Plagiarism Detection from Large Web Corpora: Recent Approaches , 2015, CLEF.

[25]  A. Lenhart,et al.  The Digital Revolution and Higher Education: College Presidents, Public Differ on Value of Online Learning. , 2011 .

[26]  Benny Pinkas,et al.  Faster Private Set Intersection Based on OT Extension , 2014, USENIX Security Symposium.

[27]  Chris Clifton,et al.  Efficient privacy-preserving similar document detection , 2010, The VLDB Journal.

[28]  Eyal Kushilevitz,et al.  Private information retrieval , 1998, JACM.

[29]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[30]  Sanjay Kumar Madria,et al.  An efficient and secure data sharing framework using homomorphic encryption in the cloud , 2012, Cloud-I '12.

[31]  Benno Stein,et al.  Towards Data Submissions for Shared Tasks: First Experiences for the Task of Text Alignment , 2015, CLEF.

[32]  O. Haggag,et al.  Plagiarism Candidate Retrieval Using Selective Query Formulation and Discriminative Query Scoring Notebook for PAN at CLEF 2013 , 2013, CLEF.

[33]  Jimmy J. Lin,et al.  When close enough is good enough: approximate positional indexes for efficient ranked retrieval , 2011, CIKM '11.

[34]  A. Lenhart,et al.  The digital revolution and higher education , 2011 .

[35]  Peter Christen,et al.  A taxonomy of privacy-preserving record linkage techniques , 2013, Inf. Syst..

[36]  Mahmood Ahmadi,et al.  An efficient and scalable plagiarism checking system using Bloom filters , 2014, Comput. Electr. Eng..

[37]  Oded Goldreich,et al.  Towards a theory of software protection and simulation by oblivious RAMs , 1987, STOC.

[38]  George A. Miller,et al.  Length-Frequency Statistics for Written English , 1958, Inf. Control..

[39]  Naomie Salim,et al.  Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[40]  Matthias Hagen,et al.  ChatNoir: a search engine for the ClueWeb09 corpus , 2012, SIGIR '12.

[41]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[42]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[43]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[44]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[45]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[46]  Eli Ben-Sasson,et al.  SNARKs for C: Verifying Program Executions Succinctly and in Zero Knowledge , 2013, CRYPTO.

[47]  Rafail Ostrovsky,et al.  Public Key Encryption with Keyword Search , 2004, EUROCRYPT.

[48]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[49]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[50]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[51]  Benno Stein,et al.  Strategies for retrieving plagiarized documents , 2007, SIGIR.