Redefining similarity in a thesaurus by using corpora

The aim of this paper is to automatically define the similarity I)etween two nouns which are generally used in various domains. By these similarities, we can construct a large and general thesaurus. In applications of natural language processing, it is necessary to appropriately measure the similarity between two nouns. The similarity is usually calculated from a thesaurus. Since a handmade thesaurus is not slfitahle for machine use, and expensive to compile, automatical construction of~a thesaurus has been a t tempted using corpora (Hindle, 1990). llowever, the thesaurus constructed by such ways does not contain so many nouns, and these nouns are specified by the used corpus. In other words, we cannot construct the general thesaurus from only a corpus. This can be regarded as data sparseness problem that few nouns appear in the corpus. 9b overcome data sparseness, methods to estimate the distribution of unseen eooecurrence frorn the distribution of similar words in the seen cooccurrence has been proposed. Brown et al. proposed a class-based n-gram model, which generalizes the n-gram model, to predict a word from previous words in a text (Brown et al., 1992). They tackled data sparseness by generalizing the word to the class which contains the word. Pereira ct al. also basically used the above method, but they proposed a soft clustering scheme, in which membership of a word in a class is probabilistic (Pereira et al., 1993). Brown and Pereira provide the clustering algorithm assigning words to proper classes, based on their own models. I)agan eL al. proposed a similarity-based model in which each word is generalized, not to its own specific class, but to a set of words which are most similar to it (Dagan et al., 1993). Using this model, they successfully l)redieted which unobserved cooccurrenccs were more likely than others, and estimated the probability of the cooecurrences (Dagan et al., 1994). However, because these schemes look for similar words in the corpus, the number of similarities which we can define is rather small in comparison with the nunlber of similarities for pairs of the whole. The scheme to look for similar words in the corpus has already taken the influence of data sparseness. In this paper, we propose a method distinct from the above methods, which use a handmade thesaurus to find similar words. The proposed method avoids data sparseness by estimating undefined similarities from the similarity in the thesaurus and similarities defined by the corpus. Thus, the obtained similarities are the same in nmuber as the similarities in the thesaurus, and they reflect the particularity of the domain to which the used corpus belongs. The use of a tlmsaurus can obviously set up the similar word independent of the tort)us, and has an advantage that some ambiguities in analyzing the corpus are solved. We have experimented by using Bunrui-goihyon(Bmlrui-goi-hyon, 1994), which is a kind of Japanese handmade thesaurus, and the corpus which consists of Japanese economic newspaper 5 years articles with about 7.85 M sentences. We evaluate the appropriateness of the obtained similarities.