Towards Unsupervised Extraction of Verb Paradigms from Large Corpora

A verb paradigm is a set of inflectional categories for a single verb lemma. To obtain verb paradigms we extracted left and right bigrams for the 400 most frequent verbs from over 100 million words of text, calculated the Kullback Leibler distance for each pair of verbs for left and right contexts separately, and ran a hierarchical clustering algorithm for each context. Our new method for finding unsupervised cut points in the cluster trees produced results that compared favorably with results obtained using supervised methods, such as gain ratio, a revised gain ratio and number of correctly classified items. Left context clusters correspond to inflectional categories, and right context clusters correspond to verb lemmas. For our test data, 91.5% of the verbs are correctly classified for inflectional category, 74.7% are correctly classified for lemma, and the correct joint classification for lemma and inflectional category was obtained for 67.5% of the verbs. These results are derived only from distributional information without use of morphological information. 1 I n t r o d u c t i o n This paper presents a new, largely unsupervised method which, given a list of verbs from a corpus, will simultaneously classify the verbs by lemma and inflectional category. Our long term research goal is to take a corpus in an unanalyzed language and to extract a grammar for the language in a matter of hours using statistical methods with minimum input from a native speaker. Unsupervised methods avoid " This work was supported by grants from Palladium Systems and the Glidden Company to the first author. The comments and suggestions of Martha Palmer, Hoa Trang Dang, Adwait Ratnaparkhi, Bill Woods, Lyle Ungar, and anonymous reviewers are also gratefully acknowledged. 110 labor intensive annotat ion required to produce the training materials for supervised methods. The cost of annotated data becomes particularly onerous for large projects across many languages, such as machine translation. If our method ports well to other languages, it could be used as a way of automatically creating a morphological analysis tool for verbs in languages where verb inflections have not already been thoroughly studied. Precursors to this work include (Pereira et al, 1993), (Brown et al. 1992), (Brill & Kapur, 1993), (Jelinek, 1990), and (Brill et al, 1990) and, as applied to child language acquisition, (Finch & Chater, 1992). Clustering algorithms have been previously shown to work fairly well for the classification of words into syntactic and semantic classes (Brown et al. 1992), but determining the optimum number of classes for a hierarchical cluster tree is an ongoing difficult problem, particularly without prior knowledge of the item classification. For semantic classifications, the correct assignment of items to classes is usually not known in advance. In these cases only an unsupervised method which has no prior knowledge of the item classification can be applied. Our approach is to evaluate our new, largely unsupervised method in a domain for which the correct classification of the items is well known, namely the inflectional category and lemma of a verb. This allows us to compare the classification produced by the unsupervised method to the classifications produced by supervised methods. The supervised methods we examine are based on information content and number of items correctly classified. Our unsupervised method uses a single parameter, the expected size of the cluster. The classifications by inflectional category and lemma are additionally interesting because they produce trees with very different shapes. The classification tree for inflectional category has a few large clusters, while the tree for verb lemmas has many small clusters. Our unsupervised method not only performs as well as the supervised methods, but is also more robust for different shapes of the classification tree. Our results are based solely on distributional criteria and are independent of morphology. We completely ignore relations between words that are derived from spelling. We assume that any difference in form indicates a different item and have not "cleaned up" the data by removing capitalization, etc. Morphology is important for the classification of verbs, and it may well solve the problem for regular verbs. However, morphological analysis will certainly not handle highly irregular, high frequency verbs. What is surprising is that strictly local context can make a significant contribution to the classification of both regular and irregular verbs. Distributional information is most easily extracted for high frequency verbs, which are the verbs that tend to have irregular morphology. This work is important because it develops a methodology for analyzing distributional information in a domain that is well known. This methodology can then be applied with some confidence to other domains for which the correct classification of the items is not known in advance, for example to the problem of semantic classification.