MORE ALIKE THAN NOT AN ANALYSIS OF WORD FREQUENCIES IN FOUR GENERAL-PURPOSE TEXT CORPORA
暂无分享,去创建一个
We compare word frequency lists derived from four general-purpose written English corpora: BNC, Brown, LOB and WSJ. Statistically significant correlation exists among the ranks of common vocabulary words appearing in more than one list, despite marked differences between the underlying corpora. The correlation may be sufficient to postulate a corpusindependent list for common words. Proper names and specific tokens such as numbers show much less correlation and should be separated from common words if their similarity is to remain unobscured. Our result has a bearing on word-sense disambiguation, text categorization and text summarization.
[1] Adam Kilgarriff,et al. Corpus Similarity and Homogeneity via Word Frequency , 1996 .
[2] Adam Kilgarriff,et al. Which words are particularly characteristic of a text? a survey of statistical approaches , 1996 .
[3] John B. Carroll,et al. The American Heritage Word Frequency Book , 1971 .