论文信息 - Empirical Linguistic Study of Sentence Embeddings - 字舞流文

Empirical Linguistic Study of Sentence Embeddings

The purpose of the research is to answer the question whether linguistic information is retained in vector representations of sentences. We introduce a method of analysing the content of sentence embeddings based on universal probing tasks, along with the classification datasets for two contrasting languages. We perform a series of probing and downstream experiments with different types of sentence embeddings, followed by a thorough analysis of the experimental results. Aside from dependency parser-based embeddings, linguistic information is retained best in the recently proposed LASER sentence embeddings.

Alina Wróblewska | Katarzyna Krasnowska-Kieras | Katarzyna Krasnowska-Kieras | Alina Wróblewska

[1] Alina Wróblewska,et al. Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format , 2018, UDW@EMNLP.

[2] Nan Hua,et al. Universal Sentence Encoder for English , 2018, EMNLP.

[3] Alina Wróblewska,et al. Polish evaluation dataset for compositional distributional semantics models , 2017, ACL.

[4] Samuel R. Bowman,et al. Grammatical Analysis of Pretrained Sentence Encoders with Acceptability Judgments , 2019, ArXiv.

[5] Milan Straka,et al. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[6] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7] Sampo Pyysalo,et al. Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[8] Xing Shi,et al. Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[9] Yonatan Belinkov,et al. Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[10] Allyson Ettinger,et al. Assessing Composition in Sentence Vector Representations , 2018, COLING.

[11] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[12] Benjamin Heinzerling,et al. BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages , 2017, LREC.

[13] Yonatan Belinkov,et al. What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[14] Samuel R. Bowman,et al. A Gold Standard Dependency Corpus for English , 2014, LREC.

[15] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16] Matteo Pagliardini,et al. Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[17] Marco Marelli,et al. SICK through the SemEval glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment , 2016, Language Resources and Evaluation.

[18] Alina Wróblewska,et al. Semi-Supervised Neural System for Tagging, Parsing and Lematization , 2018, CoNLL Shared Task.

[19] Emmanuel Dupoux,et al. Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[20] Guillaume Lample,et al. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[21] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[22] Douwe Kiela,et al. SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[23] Holger Schwenk,et al. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[24] Wlodek Zadrozny,et al. On compositional semantics , 1992, COLING.

[25] Piotr Pęzik,et al. Exploring phraseological equivalence with Paralela , 2016 .

[26] Prakhar Gupta,et al. Learning Word Vectors for 157 Languages , 2018, LREC.