Zipf's Law and Heaps' Law Can Predict the Size of Potential Words

We confirm Zipf’s law and Heaps’ law using various types of documents such as literary works, blogs, and computer programs. Independent of the document type, the exponents of Zipf’ law are estimated to be approximately 1, whereas Heaps’ exponents appear to be dependent on the observation size, and the estimated values are scattered around 0.5. By definition, randomly shuffled documents reproduce Zipf’s law and Heaps’ law. However, artificially generated documents using the empirically observed Zipf’s law and number of distinct words do not reproduce Heaps’ law. We demonstrate that Heaps’ law holds for artificial documents in which a certain number of distinct words are added to empirically observed distinct words. This suggests that the number of potential distinct words considered in the creation of a given document can be predicted.