Zipf's Law and Heaps' Law Can Predict the Size of Potential Words
暂无分享,去创建一个
We confirm Zipf’s law and Heaps’ law using various types of documents such as literary works, blogs, and computer programs. Independent of the document type, the exponents of Zipf’ law are estimated to be approximately 1, whereas Heaps’ exponents appear to be dependent on the observation size, and the estimated values are scattered around 0.5. By definition, randomly shuffled documents reproduce Zipf’s law and Heaps’ law. However, artificially generated documents using the empirically observed Zipf’s law and number of distinct words do not reproduce Heaps’ law. We demonstrate that Heaps’ law holds for artificial documents in which a certain number of distinct words are added to empirically observed distinct words. This suggests that the number of potential distinct words considered in the creation of a given document can be predicted.
[1] George Kingsley Zipf,et al. Human behavior and the principle of least effort , 1949 .
[2] H. S. Heaps,et al. Information retrieval, computational and theoretical aspects , 1978 .
[3] Pierre Baldi,et al. Modeling the Internet and the Web: Probabilistic Methods and Algorithms. By Pierre Baldi, Paolo Frasconi, Padhraic Smith, John Wiley and Sons Ltd., West Sussex, England, 2003. 285 pp ISBN 0 470 84906 1 , 2006, Inf. Process. Manag..