TATO: Leveraging on Multiple Strategies for Semantic Textual Similarity

In this paper, we describe the TATO system which participated in the SemEval-2015 Task 2a: “Semantic Textual Similarity (STS) for English”. Our system is trained on published datasets from the previous competitions. Based on some machine learning techniques, it combines multiple similarity measures of varying complexity ranging from simple lexical and syntactic similarity measures to complex semantic similarity ones to compute semantic textual similarity. Our final model consists of a simple linear combination of about 30 main features out of a numerous number of features experimented. The results are promising, with Pearson’s coefficients on each individual dataset ranging from 0.6796 to 0.8167 and an overall weighted mean score of 0.7422, well above the task baseline system.

[1]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[2]  Johan Bos,et al.  Recognising Textual Entailment with Logical Inference , 2005, HLT.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Nitin Madnani,et al.  TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate , 2009, Machine Translation.

[6]  Hwee Tou Ng,et al.  MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation , 2008, ACL.

[7]  Peter W. Foltz,et al.  The Measurement of Textual Coherence with Latent Semantic Analysis. , 1998 .

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Ion Androutsopoulos,et al.  Learning Textual Entailment using SVMs and String Similarity Measures , 2007, ACL-PASCAL@ACL.

[11]  Iryna Gurevych,et al.  UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures , 2012, *SEMEVAL.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[14]  Wael Hassan Gomaa,et al.  A Survey of Text Similarity Approaches , 2013 .

[15]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[16]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[17]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[18]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[19]  F. A B I O M A S S I M O Z A N Z O T T O,et al.  A machine learning approach to textual entailment recognition , 2009 .

[20]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[21]  Prodromos Malakasiotis,et al.  Paraphrase Recognition Using Machine Learning to Combine Similarity Measures , 2009, ACL.

[22]  Alon Lavie,et al.  Extending the METEOR Machine Translation Evaluation Metric to the Phrase Level , 2010, NAACL.

[23]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[24]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[25]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[26]  Hwee Tou Ng,et al.  TESLA: Translation Evaluation of Sentences with Linear-Programming-Based Analysis , 2010, WMT@ACL.

[27]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[28]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[29]  Fabio Rinaldi,et al.  Exploiting Paraphrases in a Question Answering System , 2003, IWP@ACL.

[30]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[31]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.