Semantic Textual Similarity in Bengali Text

Measuring the textual similarity is indispensable in many information retrieval applications. Researchers proposed numerous similarity measures to compute the semantic similarity between texts for monolingual and multilingual texts. But methods for measuring similarity for Bengali text segments are not so commonly available. In this paper, we propose an approach to estimate the semantic similarity between Bengali text segments. The similarity score is computed with the help of word-level semantics from a pre-trained word-embedding model trained on Bengali Wikipedia texts. In this regard, we employ an algorithm to measure the semantic similarity of Bengali texts. To test the performance of our method, we conducted experiments on a dataset for semantic textural similarity for Bengali texts. We prepare the dataset using the same approach as SemEval applied in the STS 2017. The experimental results in terms of Pearson correlation coefficient conclude that our method achieves a state-of-the-art performance for semantic textual similarity in Bengali texts.

[1]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[2]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[3]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[4]  Jonathan Weese,et al.  UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems , 2013, *SEMEVAL.

[5]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[6]  Hang Li,et al.  Semantic Matching in Search , 2014, SMIR@SIGIR.

[7]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[8]  Ramiz M. Aliguliyev,et al.  A new sentence similarity measure and sentence based extractive technique for automatic text summarization , 2009, Expert Syst. Appl..

[9]  Claire Cardie,et al.  SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Jan Snajder,et al.  TakeLab: Systems for Measuring Semantic Text Similarity , 2012, *SEMEVAL.

[12]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[13]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Semantic Similarity Embedding , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Amitava Das,et al.  Measuring Semantic Similarity for Bengali Tweets Using WordNet , 2015, RANLP.

[15]  Jane Hunter,et al.  UQeResearch: Semantic Textual Similarity Quantification , 2015, SemEval@NAACL-HLT.

[16]  Rafael Dueire Lins,et al.  A new sentence similarity assessment measure based on a three-layer sentence representation , 2014, DocEng '14.

[17]  Iryna Gurevych,et al.  UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures , 2012, *SEMEVAL.

[18]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[19]  Anupam Basu,et al.  A New Semantic Lexicon and Similarity Measure in Bangla , 2012 .

[20]  Vasile Rus,et al.  Measuring Semantic Similarity in Short Texts through Greedy Pairing and Word Semantics , 2012, FLAIRS Conference.

[21]  Samuel Fernando,et al.  A Semantic Similarity Approach to Paraphrase Detection , 2008 .

[22]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.