Exploiting Siamese Neural Networks on Short Text Similarity Tasks for Multiple Domains and Languages

Semantic textual similarity algorithms are essential to several natural language processing tasks as clustering documents and text summarization. Many shared tasks regarding this subject were performed during the last few years, but generally, focused on a unique domain and/or language. Siamese Neural Network (SNN) is well known for its ability to compute similarity requiring less training data. We proposed a SNN architecture incorporated with language-independent features, aiming to perform short text similarity calculation in multiple languages and domains. We explored three different corpora from shared tasks: ASSIN 1 and ASSIN 2 with Portuguese journalistic texts and N2C2 (English clinical texts). We adapted the SNN proposed by Mueller and Thyagarajan (2016), in two ways: (i) the activation functions were changed to the ReLU, instead of the sigmoid function, and; (ii) we incorporated the architecture to accept three new lexical features and an embedding layer to infer the values of the pre-trained word embeddings. The evaluation was performed by the Pearson correlation (PC) and the Mean Squared Error (MSE) between the models’ predicted values and corpora’s gold standard. Our approach achieved better results than the baseline in both languages and domains.

[1]  Denis Peskov,et al.  UMDeep at SemEval-2017 Task 1: End-to-End Shared Weight LSTM Model for Semantic Textual Similarity , 2017, SemEval@ACL.

[2]  Luciano Barbosa,et al.  Blue Man Group no ASSIN: Usando Representações Distribuídas para Similaridade Semântica e Inferência Textual , 2016, Linguamática.

[3]  Jonas Mueller,et al.  Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.

[4]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Hugo Gonçalo Oliveira,et al.  ASAPP 2.0: Advancing the state-of-the-art of semantic textual similarity for Portuguese , 2018, SLATE.

[6]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[7]  Ruslan Mitkov,et al.  Semantic Textual Similarity with Siamese Neural Networks , 2019, RANLP.

[8]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[9]  Sandra M. Aluísio,et al.  Visão Geral da Avaliação de Similaridade Semântica e Inferência Textual , 2016, Linguamática.

[10]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[11]  Marco Marelli,et al.  SICK through the SemEval glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment , 2016, Language Resources and Evaluation.

[12]  Maarten Versteegh,et al.  Learning Text Similarity with Siamese Recurrent Networks , 2016, Rep4NLP@ACL.

[13]  Claire Cardie,et al.  SemEval-2014 Task 10: Multilingual Semantic Textual Similarity , 2014, *SEMEVAL.

[14]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[15]  Nathan Hartmann,et al.  Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks , 2017, STIL.

[16]  Claire Cardie,et al.  SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.

[17]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[18]  Nathan Siegle Hartmann Solo Queue at ASSIN: Combinando Abordagens Tradicionais e Emergentes , 2016, Linguamática.

[19]  Sadid A. Hasan,et al.  Learning Portuguese Clinical Word Embeddings: A Multi-Specialty and Multi-Institutional Corpus of Clinical Narratives Supporting a Downstream Biomedical Task , 2019, MedInfo.