Dependent Bigram Identification

Dependent bigrams are two consecutive words that occur together in a text more often than would be expected purely by chance. Identifying such bigrams is an important issue since they provide valuable clues for machine translation, word sense disambiguation, and information retrieval. A variety of significance tests have been proposed (e.g., Church et. al., 1991, Dunning, 1993, Pedersen et. al, 1996) to identify these interesting lexical pairs. In this poster I present a new statistic, minimum sensitivity, that is simple to compute and is free from the underlying distributional assumptions commonly made by significance tests. The challenge in identifying dependent bigrams is that most are relatively rare regardless of the amount of text being considered. This follows from the distributional tendencies of individual bigrams as described by Zipf’s Law. If the frequencies of the bigrams in a text are ordered from most to least frequent, (fl, f~, ..., f,,), these frequencies roughly obey fi oc Consider the following example from a 1,300,000 word sample of the ACL/DCI Wall Street Journal Corpus. A contingency table containing the frequency counts of oil and industry is shown below. These counts show that oil industry occurs 17 times, oil occurs without industry 240 times, industry occurs without oil 1001 times, and bigrams other than oil industry occur 1,298,742 times. This distribution is sparse and skewed and thus violates a central assumption implicit in significance testing of contingency tables (l~ead Cressie 1988).