MORE ALIKE THAN NOT AN ANALYSIS OF WORD FREQUENCIES IN FOUR GENERAL-PURPOSE TEXT CORPORA

We compare word frequency lists derived from four general-purpose written English corpora: BNC, Brown, LOB and WSJ. Statistically significant correlation exists among the ranks of common vocabulary words appearing in more than one list, despite marked differences between the underlying corpora. The correlation may be sufficient to postulate a corpusindependent list for common words. Proper names and specific tokens such as numbers show much less correlation and should be separated from common words if their similarity is to remain unobscured. Our result has a bearing on word-sense disambiguation, text categorization and text summarization.