Learning Semantic Representations with Hidden Markov Topics Models

Learning Semantic Representations with Hidden Markov Topics Models Mark Andrews (m.andrews@ucl.ac.uk) Gabriella Vigliocco (g.vigliocco@ucl.ac.uk) Cognitive, Perceptual and Brain Sciences University College London, 26 Bedford Way London, WC1H 0AP United Kingdom Abstract information is unavailable in bag-of-words models and consequently the extent to which they can extract se- mantic information from text, or adequately model hu- man semantic learning, is limited. In this paper, we describe a distributional model that goes beyond the bag-of-words paradigm. This model is a natural extension to the current state of the art in probabilistic bag-of-words models, namely the Topics model described in Griffiths et al. (2007) and elsewhere. The model we propose is a seamless continuation of the Topics model, preserving its strengths — its thoroughly unsupervised learning, its hierarchical Bayesian nature — while extending its scope to incorporate more fine- grained sequential and syntactic data. In this paper, we describe a model that learns seman- tic representations from the distributional statistics of language. This model, however, goes beyond the com- mon bag-of-words paradigm, and infers semantic repre- sentations by taking into account the inherent sequential nature of linguistic data. The model we describe, and which we refer to as a Hidden Markov Topics model is a natural extension of the current state of the art in Bayesian bag-of-words models, i.e. the Topics model of Griffiths, Steyvers, and Tenenbaum (2007), preserving its strengths while extending its scope to incorporate more fine-grained linguistic information. Introduction How word meanings are learned is a foundational prob- lem in the study of human language use. Within cogni- tive science, a promising recent approach to this problem has been the study of how the meanings of words can be learned from their statistical distribution across the language. This approach is motivated by the so-called distributional hypothesis, originally due to Harris (1954) and Firth (1957), which proposes that the meaning of a word is given by the linguistic contexts in which it occurs. Numerous large-scale computational implemen- tations of this approach — including, for example, the work of Sch¨ utze (1992), the HAL model (Lund, Burgess, & Atchley, 1995), the LSA model (Landauer & Dumais, 1997) and, most recently, the Topics model (Griffiths et al., 2007) — have succesfully demonstrated that the meanings of words can, at least in part, be derived from their statistical distribution in language. Important as these computational models have been, one of their widely shared practices has been to treat the linguistic contexts in which a word occurs as unordered sets of words. In other words, the linguistic context of any given word is defined by which words co-occur with it and with what frequency, but it disregards all fine- grained sequential and syntactic information. By disre- garding these types of data, these so-called bag-of-words models drastically restrict the information from which word meanings can be learned. All languages have strong syntactic-semantic correlations. The sequential order in which words occur, the argument structure and general syntactic relationships within sentences, all provide vital information about the possible meaning of words. This The Topics Model The standard Topics model as described in Griffiths and Steyvers (2002, 2003); Griffiths et al. (2007) is a prob- abilistic generative model for texts, and is based on the Latent Dirichlet Allocation (LDA) model of Blei, Ng, and Jordan (2003). It stipulates that each word in a cor- pus of texts is drawn from one of K latent distributions φ 1 . . . φ k . . . φ K $ φ, with each φ k being a probability distribution over the V word-types in a fixed vocabu- lary. These distributions are the so-called topics that give the model its name. Some examples, learned by a Topics model described in Andrews, Vigliocco, and Vin- son (In Press), are given in the table below (each column gives the 7 most probable word types in each topic). theatre stage arts play dance opera cast music band rock song record pop dance league cup season team game match division prison years sentence jail home prisoner serving air aircraft flying flight plane airport pilot As is evident from this table, each topic is a cluster of related terms that corresponds to a coherent semantic theme, or subject-matter. As such, the topic distribu- tions correspond to the semantic knowledge learned by the model, and the semantic representation of each word in the vocabulary is given by a distribution over them.