Testing the extrapolation quality of word frequency models

Many studies in corpus linguistics and related disciplines aim to determine the characteristic aspects of a language, a particular genre, a group of speakers, an individual speaker or a linguistic process. In order to do so, they compute certain numerical quantities from an available text sample (i.e., a corpus) and extrapolate them to the full language (genre, speaker, process, etc.), or at least to much larger samples. For example, a corpus linguist might use the Brown and LOB corpora in this way to draw inferences about the differences between American and British English; a stylometrist might count the different words in the Shakespeare canon in order to estimate the richness of his vocabulary; and a morphologist might try to determine whether a certain word formation process is more productive than another by comparing the number of nonce words formed by each of the processes.