论文信息 - Embeddings in Natural Language Processing

Embeddings in Natural Language Processing

Embeddings have been one of the most important topics of interest in Natural Language Processing (NLP) for the past decade. Representing knowledge through a low-dimensional vector which is easily integrable in modern machine learning models has played a central role in the development of the field. Embedding techniques initially focused on words but the attention soon started to shift to other forms. This tutorial will provide a high-level synthesis of the main embedding techniques in NLP, in the broad sense. We will start by conventional word embeddings (e.g., Word2Vec and GloVe) and then move to other types of embeddings, such as sense-specific and graph alternatives. We will finalize with an overview of the trending contextualized representations (e.g., ELMo and BERT) and explain their potential and impact in NLP. 1 Description In this tutorial we will start by providing a historical overview on word-level vector space models, and word embeddings in particular. Word embeddings (e.g. Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) or FastText (Bojanowski et al., 2017)) have proven to be powerful keepers of prior knowledge to be integrated into downstream Natural Language Processing (NLP) applications. However, despite their flexibility and success in capturing semantic properties of words, the effectiveness of word embeddings are generally hampered by an important limitation, known as the meaning conflation deficiency: the inability to discriminate among different meanings of a word. A word can have one meaning (monosemous) or multiple meanings (ambiguous). For instance, the noun mouse can refer to two different meanings depending on the context: an animal or a computer device. Hence, mouse is said to be ambiguous. In fact, according to the Principle of Economical Versatility of Words (Zipf, 1949), frequent words tend to have more senses. Moreover, this meaning conflation can have additional negative impacts on accurate semantic modeling, e.g., semantically unrelated words that are similar to different senses of a word are pulled towards each other in the semantic space (Neelakantan et al., 2014; Pilehvar and Collier, 2016). In our example, the two semantically-unrelated words rat and screen are pulled towards each other in the semantic space for their similarities to two different senses of mouse (see Figure 1). This, in turn, contributes to the violation of the triangle inequality in euclidean spaces (Tversky and Gati, 1982; Neelakantan et al., 2014). Accurately capturing the meaning of words (both ambiguous and unambiguous) plays a crucial role in the language understanding of NLP systems. In order to deal with the meaning conflation deficiency, this tutorial covers approaches have attempted to model individual word senses (Reisinger and Mooney, 2010; Huang et al., 2012; Neelakantan et al., 2014; Rothe and Schütze, 2015; Li and Jurafsky, 2015; Pilehvar and Collier, 2016; Mancini et al., 2017). Sense representation techniques, however, suffer from limitations which hinders their effective application in downstream NLP tasks: they either need vast amounts of training data to obtain reliable representations or require an additional sense disambiguation on the input text to make them integrable into NLP systems. This data is highly expensive to obtain in practice, which causes the so-called knowledge-acquisition bottleneck (Gale et al., 1992). As a practical way to deal with the knowledge-acquisition bottleneck, an emerging branch of research has focused on directly integrating unsupervised embeddings into downstream models. Instead of learning a fixed number of senses per word, contextualized word embeddings learn “senses” dynamically,

Mohammad Taher Pilehvar | Jose Camacho-Collados | Mohammad Taher Pilehvar | José Camacho-Collados