ECNU at SemEval-2016 Task 1: Leveraging Word Embedding From Macro and Micro Views to Boost Performance for Semantic Textual Similarity

This paper presents our submissions for semantic textual similarity task in SemEval 2016. Based on several traditional features (i.e., string-based, corpus-based, machine translation similarity and alignment metrics), we leverage word embedding from macro (i.e., first get representation of sentence, then measure the similarity of sentence pair) and micro views (i.e., measure the similarity of word pairs separately) to boost performance. Due to the various domains of training data and test data, we adopt three different strategies: 1) U-SEVEN: an unsupervised model, which utilizes seven straight-forward metrics; 2) S1-All: using all available datasets; 3) S2: selecting the most similar training sets for each test set. Results on test sets show that the unified supervised model (i.e., S1-All) achieves the best averaged performance with a mean correlation of 75.07%.

[1]  Gilles Louppe,et al.  Independent consultant , 2013 .

[2]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[3]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[4]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[5]  Jan Snajder,et al.  TakeLab: Systems for Measuring Semantic Text Similarity , 2012, *SEMEVAL.

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Claire Cardie,et al.  SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.

[8]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.

[9]  Steven Bethard,et al.  DLS@CU: Sentence Similarity from Word Alignment and Semantic Vector Composition , 2015, *SEMEVAL.

[10]  Man Lan,et al.  ECNU: One Stone Two Birds: Ensemble of Heterogenous Measures for Semantic Relatedness and Textual Entailment , 2014, *SEMEVAL.

[11]  Alexander F. Gelbukh,et al.  Soft Cardinality: A Parameterized Similarity Function for Text Comparison , 2012, *SEMEVAL.

[12]  Iryna Gurevych,et al.  UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures , 2012, *SEMEVAL.

[13]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  Eneko Agirre,et al.  Studying the Wikipedia Hyperlink Graph for Relatedness and Disambiguation , 2015, ArXiv.