CompiLIG at SemEval-2017 Task 1: Cross-Language Plagiarism Detection Methods for Semantic Textual Similarity

We present our submitted systems for Semantic Textual Similarity (STS) Track 4 at SemEval-2017. Given a pair of Spanish-English sentences, each system must estimate their semantic similarity by a score between 0 and 5. In our submission, we use syntax-based, dictionary-based, context-based, and MT-based methods. We also combine these methods in unsupervised and supervised way. Our best run ranked 1st on track 4a with a correlation of 83.02% with human annotations.

[1]  Alberto Barrón-Cedeño,et al.  Cross-Language High Similarity Search Using a Conceptual Thesaurus , 2012, CLEF.

[2]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[3]  Didier Schwab,et al.  A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection , 2016, LREC.

[4]  Olivier Pietquin,et al.  MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP , 2016, LREC.

[5]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[6]  Máté Pataki A new approach for searching translated plagiarism , 2012 .

[7]  Hinrich Schütze,et al.  Introduction to Information Retrieval: Scoring, term weighting, and the vector space model , 2008 .

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Ian H. Witten,et al.  Induction of model trees for predicting continuous classes , 1996 .

[10]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[11]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[12]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[13]  Steven Bethard,et al.  DLS@CU: Sentence Similarity from Word Alignment and Semantic Vector Composition , 2015, *SEMEVAL.

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[16]  Gilles Sérasset,et al.  DBnary: Wiktionary as a Lemon-based multilingual lexical resource in RDF , 2015, Semantic Web.

[17]  Tomas Brychcin,et al.  UWB at SemEval-2016 Task 1: Semantic Textual Similarity using Lexical, Syntactic, and Semantic Information , 2016, *SEMEVAL.

[18]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[19]  Laurent Besacier,et al.  Using Word Embedding for Cross-Language Plagiarism Detection , 2017, EACL.

[20]  Frank Vanden Berghen,et al.  CONDOR, a new parallel, constrained extension of Powell's UOBYQA algorithm: experimental results and comparison with the DFO algorithm , 2005 .

[21]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .