Semantic textual similarity between sentences using bilingual word semantics

Semantic textual similarity between sentences is indispensable for many information retrieval tasks. Traditional lexical similarity measures cannot compute the similarity beyond a trivial level. Moreover, they only can capture the textual similarity, but not semantic. In this paper, we propose a method for semantic textual similarity that leverages bilingual word-level semantics to compute the semantic similarity between sentences. To capture word-level semantics, we employ distribute representation of words in two different languages. The similarity function based on the concept-to-concept relationship corresponding to the words is also utilized for the same purpose. Multiple new semantic similarity measures are introduced based on word-embedding models trained on two different corpora in two different languages. Apart from these, another new semantic similarity measure is also introduced using the word sense comparison. The similarity score between the sentences is then computed by applying a linear ranking approach to all proposed measures with their importance score estimated employing a supervised feature selection technique. We conducted experiments on the SemEval Semantic Textual Similarity (STS-2017) test collections. The experimental results demonstrated that our method is effective for measuring semantic textual similarity and outperforms some known related methods.

[1]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[2]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[3]  Valentin Jijkoun,et al.  Recognizing Textual Entailment Using Lexical Similarity , 2005 .

[4]  Zornitsa Kozareva,et al.  Adaptation of a Machine-learning Textual Entailment System to a Multilingual Answer Validation Exercise , 2006, CLEF.

[5]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Semantic Similarity Embedding , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Samuel Fernando,et al.  A Semantic Similarity Approach to Paraphrase Detection , 2008 .

[7]  Valentin Jijkoun,et al.  Recognizing Textual Entailment: Is Word Similarity Enough? , 2005, MLCW.

[8]  Jan Snajder,et al.  TakeLab: Systems for Measuring Semantic Text Similarity , 2012, *SEMEVAL.

[9]  Denis Peskov,et al.  UMDeep at SemEval-2017 Task 1: End-to-End Shared Weight LSTM Model for Semantic Textual Similarity , 2017, SemEval@ACL.

[10]  Hang Li,et al.  Semantic Matching in Search , 2014, SMIR@SIGIR.

[11]  Masaki Aono,et al.  Query subtopic diversification based on cluster ranking and semantic features , 2016, 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA).

[12]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[13]  Claire Cardie,et al.  SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.

[14]  Ramiz M. Aliguliyev,et al.  A new sentence similarity measure and sentence based extractive technique for automatic text summarization , 2009, Expert Syst. Appl..

[15]  Vasile Rus,et al.  Measuring Semantic Similarity in Short Texts through Greedy Pairing and Word Semantics , 2012, FLAIRS Conference.

[16]  Johannes Bjerva,et al.  ResSim at SemEval-2017 Task 1: Multilingual Word Representations for Semantic Textual Similarity , 2017, SemEval@ACL.

[17]  Alberto Barrón-Cedeño,et al.  Lump at SemEval-2017 Task 1: Towards an Interlingua Semantic Similarity , 2017, SemEval@ACL.

[18]  Jane Hunter,et al.  UQeResearch: Semantic Textual Similarity Quantification , 2015, SemEval@NAACL-HLT.

[19]  Rafael Dueire Lins,et al.  A new sentence similarity assessment measure based on a three-layer sentence representation , 2014, DocEng '14.

[20]  Iryna Gurevych,et al.  UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures , 2012, *SEMEVAL.

[21]  Ergun Biçici RTM at SemEval-2017 Task 1: Referential Translation Machines for Predicting Semantic Similarity , 2017, SemEval@ACL.

[22]  Peter Bühlmann Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .

[23]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[24]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[25]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[26]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[27]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[28]  Jonathan Weese,et al.  UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems , 2013, *SEMEVAL.

[29]  Fakhri Karray,et al.  Elastic net for paralinguistic speech recognition , 2012, ICMI '12.