A monolingual approach to detection of text reuse in Russian-English collection

In this paper we develop a method for cross-lingual (Russian and English) text reuse detection. The method is based on the monolingual approach - translation of texts into one language and reduction to the text similarity problem. We split texts into non-overlapping fragments and compare fragments to each other by means of different metrics - BLEU(1-2), ME-TEOR, cosine similarity between bag-of-words representations of each snippet, and cosine similarity between vectors obtained from doc2vec-trained model. We explore the impact of choice of metric on the quality of text reuse detection. We assess quality of the method on a sample of a hundred scientific documents, originally in Russian, machine translated into English. Preliminary findings demonstrate feasibility of the approach.

[1]  Séamus Lawless,et al.  OntoSeg: A Novel Approach to Text Segmentation Using Ontological Similarity , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[2]  Adrian Sanborn,et al.  Deep Learning for Semantic Similarity , 2015 .

[3]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[4]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[5]  W. Zheng,et al.  Facial expression recognition using kernel canonical correlation analysis (KCCA) , 2006, IEEE Transactions on Neural Networks.

[6]  Mikhail Korobov,et al.  Morphological Analyzer and Generator for Russian and Ukrainian Languages , 2015, AIST.

[7]  Hugo Gonçalo Oliveira,et al.  Tra-la-Lyrics: An approach to generate text based on rhythm , 2007 .

[8]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[9]  Simone Paolo Ponzetto,et al.  BabelNet: Building a Very Large Multilingual Semantic Network , 2010, ACL.

[10]  Nitin Madnani,et al.  Re-examining Machine Translation Metrics for Paraphrase Identification , 2012, NAACL.

[11]  Benno Stein,et al.  Corpus and Evaluation Measures for Automatic Plagiarism Detection , 2010, LREC.

[12]  Yi Mao,et al.  The Locally Weighted Bag of Words Framework for Document Representation , 2007, J. Mach. Learn. Res..

[13]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[15]  Alberto Barrón-Cedeño,et al.  Plagiarism Detection across Distant Language Pairs , 2010, COLING.

[16]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[17]  Sebastian Stüker,et al.  Maximum entropy language modeling for Russian ASR , 2013, IWSLT.

[18]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[19]  Benno Stein,et al.  The ESA retrieval model revisited , 2009, SIGIR.

[20]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[21]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[22]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[23]  Benno Stein,et al.  Overview of the PAN/CLEF 2015 Evaluation Lab , 2015, CLEF.

[24]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[25]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[26]  George Tsatsaronis Identifying free text plagiarism based on semantic similarity , 2010 .

[27]  Violaine Prince,et al.  Text Segmentation Based on Document Understanding for Information Retrieval , 2007, NLDB.

[28]  Nello Cristianini,et al.  Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis , 2002, NIPS.

[29]  David Page,et al.  Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals , 2013, ECML/PKDD.

[30]  Roman Kern,et al.  External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[31]  David Page,et al.  Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals , 2013, ECML/PKDD.

[32]  Alberto Barrón-Cedeño,et al.  Cross-Language High Similarity Search Using a Conceptual Thesaurus , 2012, CLEF.

[33]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[34]  Parth Gupta,et al.  Knowledge Graphs as Context Models: Improving the Detection of Cross-Language Plagiarism with Paraphrasing , 2013, PROMISE Winter School.

[35]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[36]  Susan T. Dumais,et al.  Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing , 1998 .

[37]  Jesús Cardeñosa Lera,et al.  Interlingual Information Extraction as a Solution for Multilingual QA Systems , 2009, FQAS.

[38]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[39]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[40]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[41]  Min Xiao,et al.  A Novel Two-Step Method for Cross Language Representation Learning , 2013, NIPS.

[42]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[43]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.