On distributions of sentence lengths in Japanese writing

The lognormal distribution had long been thought to be the most appropriate probability distribution for Japanese sentence length distributions. Yet this view had been supported only by few researches with sparse sampling data and reasoning contradicting language reality. In order to show a more realistic approach, we analyzed a substantial number of samples. At first, 150 essays and short stories were drawn as a random sample, out of which any pieces of writing whose length was either less than 100 or more than 1000 sentences were excluded. As a result, 113 pieces remained as sample texts. We also paid attention to the kinds of sentences, separating those of dialogue from narrative ones. From each one of these 113 sample texts, three sentence length frequency distributions were acquired - the first one for a complete text, the second one for the collection of direct speech in the same text, and the third one for all the narrative parts excluding direct speech above. The results completely overturn the long-standing belief, proving that a lognormal distribution - which has been computed but will not be shown here - can never be well applied to Japanese sentence length distributions. Our new findings indicate that in place of this lognormal distribution, the Hyperpascal distribution maintains an excellent goodness of fit.