A Bernoulli mixture model for word categorisation

The problem of word categorisation is formulated as one of unsupervised mixture modelling where Bernoulli distributions capture contextual information. We detail how the free parameters of the mixture models can be estimated through an EM procedure. A deterministic word-to-class mapping is derived from this model using a hierarchical clustering algorithm. Categorisation plays an important role in language modelling. It let us reduce the number of free parameters to be estimated and allow us to easily increase the vocabulary of the task without the need for retraining. In this paper, we try to solve the word-class selection problem by means of a non-supervised method which uses contextual information of the words in the training set together with an adequate distance measure. This paper describes a technique to build a word hierarchical structure through an efficient agglomerative hierarchical clustering algorithm, in a syntax-constrained task. This way, assigning words to categories seems to be an easy job since breaking this structure wherever you want gives you a division of the vocabulary words into categories. We call this algorithm efficient becauses it uses minheaps in order to avoid an extensive search of the nearest neighbour of each sample. Methods for a good codification of the words, based on the words usually around them in the sentences of the task, are described and experiments in order to tune some essential representation and algorithm-dependent parameters were carried out. Finally, subjectively good results were achieved and the reason for calling them subjective is that the only way to evaluate the results is looking at the obtained structure and giving her a mark.