Inverse Document Frequency (IDF): A Measure of Deviations from Poisson
暂无分享,去创建一个
Low frequency words tend to be rich in content, and vice versa. But not all equally frequent words are equally meaningful. We will use inverse document frequency (IDF), a quantity borrowed from Information Retrieval, to distinguish words like somewhat and boycott. Both somewhat and boycott appeared approximately 1000 times in a corpus of 1989 Associated Press articles, but boycott is a better keyword because its IDF is farther from what would be expected by chance (Poisson).
[1] W. Burghardt,et al. Text processing , 1979 .
[2] Slava M. Katz. Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.
[3] C. E. SHANNON,et al. A mathematical theory of communication , 1948, MOCO.
[4] F. Mosteller,et al. Inference and Disputed Authorship: The Federalist , 1966 .
[5] Gerald Salton,et al. Automatic text processing , 1988 .