Empirical Linguistic Study of Sentence Embeddings

The purpose of the research is to answer the question whether linguistic information is retained in vector representations of sentences. We introduce a method of analysing the content of sentence embeddings based on universal probing tasks, along with the classification datasets for two contrasting languages. We perform a series of probing and downstream experiments with different types of sentence embeddings, followed by a thorough analysis of the experimental results. Aside from dependency parser-based embeddings, linguistic information is retained best in the recently proposed LASER sentence embeddings.

[1]  Alina Wróblewska,et al.  Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format , 2018, UDW@EMNLP.

[2]  Nan Hua,et al.  Universal Sentence Encoder for English , 2018, EMNLP.

[3]  Alina Wróblewska,et al.  Polish evaluation dataset for compositional distributional semantics models , 2017, ACL.

[4]  Samuel R. Bowman,et al.  Grammatical Analysis of Pretrained Sentence Encoders with Acceptability Judgments , 2019, ArXiv.

[5]  Milan Straka,et al.  Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[8]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[9]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[10]  Allyson Ettinger,et al.  Assessing Composition in Sentence Vector Representations , 2018, COLING.

[11]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[12]  Benjamin Heinzerling,et al.  BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages , 2017, LREC.

[13]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[14]  Samuel R. Bowman,et al.  A Gold Standard Dependency Corpus for English , 2014, LREC.

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[17]  Marco Marelli,et al.  SICK through the SemEval glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment , 2016, Language Resources and Evaluation.

[18]  Alina Wróblewska,et al.  Semi-Supervised Neural System for Tagging, Parsing and Lematization , 2018, CoNLL Shared Task.

[19]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[20]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[21]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[22]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[23]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[24]  Wlodek Zadrozny,et al.  On compositional semantics , 1992, COLING.

[25]  Piotr Pęzik,et al.  Exploring phraseological equivalence with Paralela , 2016 .

[26]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.