论文信息 - A monolingual approach to detection of text reuse in Russian-English collection - 字舞流文

A monolingual approach to detection of text reuse in Russian-English collection

In this paper we develop a method for cross-lingual (Russian and English) text reuse detection. The method is based on the monolingual approach - translation of texts into one language and reduction to the text similarity problem. We split texts into non-overlapping fragments and compare fragments to each other by means of different metrics - BLEU(1-2), ME-TEOR, cosine similarity between bag-of-words representations of each snippet, and cosine similarity between vectors obtained from doc2vec-trained model. We explore the impact of choice of metric on the quality of text reuse detection. We assess quality of the method on a sample of a hundred scientific documents, originally in Russian, machine translated into English. Preliminary findings demonstrate feasibility of the approach.

Alexey Romanov | Rita Kuznetsova | Oleg Bakhteev | Anton Khritankov | Rita Kuznetsova | A. Khritankov | O. Bakhteev | A. Romanov

[1] Séamus Lawless,et al. OntoSeg: A Novel Approach to Text Segmentation Using Ontological Similarity , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[2] Adrian Sanborn,et al. Deep Learning for Semantic Similarity , 2015 .

[3] Ewan Klein,et al. Natural Language Processing with Python , 2009 .

[4] Benno Stein,et al. Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[5] W. Zheng,et al. Facial expression recognition using kernel canonical correlation analysis (KCCA) , 2006, IEEE Transactions on Neural Networks.

[6] Mikhail Korobov,et al. Morphological Analyzer and Generator for Russian and Ukrainian Languages , 2015, AIST.

[7] Hugo Gonçalo Oliveira,et al. Tra-la-Lyrics: An approach to generate text based on rhythm , 2007 .

[8] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[9] Simone Paolo Ponzetto,et al. BabelNet: Building a Very Large Multilingual Semantic Network , 2010, ACL.

[10] Nitin Madnani,et al. Re-examining Machine Translation Metrics for Paraphrase Identification , 2012, NAACL.

[11] Benno Stein,et al. Corpus and Evaluation Measures for Automatic Plagiarism Detection , 2010, LREC.

[12] Yi Mao,et al. The Locally Weighted Bag of Words Framework for Document Representation , 2007, J. Mach. Learn. Res..

[13] Ian T. Jolliffe,et al. Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[15] Alberto Barrón-Cedeño,et al. Plagiarism Detection across Distant Language Pairs , 2010, COLING.

[16] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[17] Sebastian Stüker,et al. Maximum entropy language modeling for Russian ASR , 2013, IWSLT.

[18] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[19] Benno Stein,et al. The ESA retrieval model revisited , 2009, SIGIR.

[20] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[21] Miles Osborne,et al. Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[22] Matthias Hagen,et al. Overview of the 1st international competition on plagiarism detection , 2009 .

[23] Benno Stein,et al. Overview of the PAN/CLEF 2015 Evaluation Lab , 2015, CLEF.

[24] Chris Quirk,et al. Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[25] Mauro Cettolo,et al. IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[26] George Tsatsaronis. Identifying free text plagiarism based on semantic similarity , 2010 .

[27] Violaine Prince,et al. Text Segmentation Based on Document Understanding for Information Retrieval , 2007, NLDB.

[28] Nello Cristianini,et al. Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis , 2002, NIPS.

[29] David Page,et al. Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals , 2013, ECML/PKDD.

[30] Roman Kern,et al. External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[31] David Page,et al. Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals , 2013, ECML/PKDD.

[32] Alberto Barrón-Cedeño,et al. Cross-Language High Similarity Search Using a Conceptual Thesaurus , 2012, CLEF.

[33] Zuhair Bandar,et al. Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[34] Parth Gupta,et al. Knowledge Graphs as Context Models: Improving the Detection of Cross-Language Plagiarism with Paraphrasing , 2013, PROMISE Winter School.

[35] Jörg Tiedemann,et al. Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[36] Susan T. Dumais,et al. Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing , 1998 .

[37] Jesús Cardeñosa Lera,et al. Interlingual Information Extraction as a Solution for Multilingual QA Systems , 2009, FQAS.

[38] Freddy Y. Y. Choi. Advances in domain independent linear text segmentation , 2000, ANLP.

[39] Jeffrey Pennington,et al. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[40] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[41] Min Xiao,et al. A Novel Two-Step Method for Cross Language Representation Learning , 2013, NIPS.

[42] Jürgen Schmidhuber,et al. Deep learning in neural networks: An overview , 2014, Neural Networks.

[43] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.