Embeddings in Natural Language Processing

Embeddings have been one of the most important topics of interest in Natural Language Processing (NLP) for the past decade. Representing knowledge through a low-dimensional vector which is easily integrable in modern machine learning models has played a central role in the development of the field. Embedding techniques initially focused on words but the attention soon started to shift to other forms. This tutorial will provide a high-level synthesis of the main embedding techniques in NLP, in the broad sense. We will start by conventional word embeddings (e.g., Word2Vec and GloVe) and then move to other types of embeddings, such as sense-specific and graph alternatives. We will finalize with an overview of the trending contextualized representations (e.g., ELMo and BERT) and explain their potential and impact in NLP. 1 Description In this tutorial we will start by providing a historical overview on word-level vector space models, and word embeddings in particular. Word embeddings (e.g. Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) or FastText (Bojanowski et al., 2017)) have proven to be powerful keepers of prior knowledge to be integrated into downstream Natural Language Processing (NLP) applications. However, despite their flexibility and success in capturing semantic properties of words, the effectiveness of word embeddings are generally hampered by an important limitation, known as the meaning conflation deficiency: the inability to discriminate among different meanings of a word. A word can have one meaning (monosemous) or multiple meanings (ambiguous). For instance, the noun mouse can refer to two different meanings depending on the context: an animal or a computer device. Hence, mouse is said to be ambiguous. In fact, according to the Principle of Economical Versatility of Words (Zipf, 1949), frequent words tend to have more senses. Moreover, this meaning conflation can have additional negative impacts on accurate semantic modeling, e.g., semantically unrelated words that are similar to different senses of a word are pulled towards each other in the semantic space (Neelakantan et al., 2014; Pilehvar and Collier, 2016). In our example, the two semantically-unrelated words rat and screen are pulled towards each other in the semantic space for their similarities to two different senses of mouse (see Figure 1). This, in turn, contributes to the violation of the triangle inequality in euclidean spaces (Tversky and Gati, 1982; Neelakantan et al., 2014). Accurately capturing the meaning of words (both ambiguous and unambiguous) plays a crucial role in the language understanding of NLP systems. In order to deal with the meaning conflation deficiency, this tutorial covers approaches have attempted to model individual word senses (Reisinger and Mooney, 2010; Huang et al., 2012; Neelakantan et al., 2014; Rothe and Schütze, 2015; Li and Jurafsky, 2015; Pilehvar and Collier, 2016; Mancini et al., 2017). Sense representation techniques, however, suffer from limitations which hinders their effective application in downstream NLP tasks: they either need vast amounts of training data to obtain reliable representations or require an additional sense disambiguation on the input text to make them integrable into NLP systems. This data is highly expensive to obtain in practice, which causes the so-called knowledge-acquisition bottleneck (Gale et al., 1992). As a practical way to deal with the knowledge-acquisition bottleneck, an emerging branch of research has focused on directly integrating unsupervised embeddings into downstream models. Instead of learning a fixed number of senses per word, contextualized word embeddings learn “senses” dynamically,

[1]  Luke S. Zettlemoyer,et al.  Dissecting Contextual Word Embeddings: Architecture and Representation , 2018, EMNLP.

[2]  Hinrich Schütze,et al.  AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes , 2015, ACL.

[3]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[4]  Raymond J. Mooney,et al.  Multi-Prototype Vector-Space Models of Word Meaning , 2010, NAACL.

[5]  Nigel Collier,et al.  De-Conflated Semantic Representations , 2016, EMNLP.

[6]  Mohammad Taher Pilehvar,et al.  Embeddings in Natural Language Processing: Theory and Advances in Vector Representations of Meaning , 2020, Embeddings in Natural Language Processing.

[7]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[8]  Daniel Jurafsky,et al.  Do Multi-Sense Embeddings Improve Natural Language Understanding? , 2015, EMNLP.

[9]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[10]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[13]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[14]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[15]  Ignacio Iacobacci,et al.  Embedding Words and Senses Together via Joint Knowledge-Enhanced Training , 2016, CoNLL.

[16]  José Camacho-Collados,et al.  From Word to Sense Embeddings: A Survey on Vector Representations of Meaning , 2018, J. Artif. Intell. Res..

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Ido Dagan,et al.  context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  A. Tversky,et al.  Similarity, separability, and the triangle inequality. , 1982, Psychological review.

[21]  Andrew McCallum,et al.  Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space , 2014, EMNLP.