We have known for some time that content words have "bursty" distributions in text (eg Church 00). In contrast, much of the literature assumes that function words are uninformative because they distribute homogeneously (eg Katz 96). In this paper based on two sets of experiments, we show that assumptions of homogeneity do not hold, even for the distrib- ution of extremely frequent function words. In the first experiment, we investigate the behav- iour of very frequent function words in the TIPSTER collection by postulating a "homogeneity assumption", which we then defeat in a series of experiments based on the χ2 test. Results show that it is statistically unreasonable to assume homogeneous term distributions within a corpus. We also found that document collec- tions are not neutral with respect to the property of homogeneity, even for very frequent function words. In the second set of experiment, we model the gaps between successive occurrences of a particular term using a mixture of exponential distributions. Based on the "homogeneity assumption" these gaps should be uniformly distributed across the entire corpus. But, using the model we demonstrate that gaps are not uniformly distributed, and even very frequent terms do occur in bursts. Since the homogeneity assumption was defeated resoundingly for diverse collections, we propose that these homogeneity measures and the re-occurrence model are suitable candidates for corpus profiling.
[1]
C. Robert.
Mixtures of Distributions: Inference and Estimation
,
1996
.
[2]
Slava M. Katz.
Distribution of content words and phrases in text and language modelling
,
1996,
Natural Language Engineering.
[3]
Alexander Franz.
Independence Assumptions Considered Harmful
,
1997,
ACL.
[4]
Ted Dunning,et al.
Accurate Methods for the Statistics of Surprise and Coincidence
,
1993,
CL.
[5]
Adam Kilgarriff,et al.
Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora
,
1997,
VLC.
[6]
Paul Rayson,et al.
Comparing Corpora using Frequency Profiling
,
2000,
Proceedings of the workshop on Comparing corpora -.
[7]
David B. Dunson,et al.
Bayesian Data Analysis
,
2010
.