Measuring Lexical Style and Competence: The Type-Token Vocabulary Curve

A personal computer is used to analyze samples from literary texts by thirteen different authors, including passages from Genesis, Hemingway, Longfellow, Jane Austen, Henry James, George Eliot, James Joyce, and Basic English (created by C. K. Ogden). The total number of words (tokens) and the number of distinct vocabulary words (types) are computed for each sample. The number of types are then plotted against the number of tokens for eight of the texts. From these type-token curves, inferences are drawn about both lexical style (vocabulary use) and lexical competence (vocabulary size). For example, the curves for "Big Two-Hearted River" and for a summary of Macbeth in Basic English nearly coincide for their first 1100 tokens, after which they gradually diverge. This graphical pattern corresponds with intuitions that Hemingway's prose reads much like Basic English but that it draws upon a larger total vocabulary. The curve for Joyce's Ulysses, by contrast, rises much more rapidly than that for a late passage from A Portrait of the Artist as a Young Man; however, after 800 tokens, the two curves begin to converge. This suggests that the difference between Ulysses and Portrait is largely one of lexical style rather than competence. The highest type-token curve for the samples tested was that for Finnegans Wake; the lowest curve was for Genesis. Comparison with type-token statistics gathered by Kucera and Francis suggests that the curves for the Wake and Genesis are near the maxima and minima for English literature. One of the earliest claims made by generative grammarians, now widely accepted, is that we acquire a nearly complete grammar of our native language by about the age of six or eight. But insofar as a generative grammar includes a lexicon, acquisition is never complete: throughout our lives, we continue to add new words to our vocabularies and to alter our use of old ones. Our lexicon--our vocabulary--is one facet of linguistic knowledge (competence), just as our choice of words in specific cases is one facet of linguistic style (performance). Most generative grammarians have assumed that statistics are useful in describing style but not competence. As a rule of thumb, if a linguistics article includes diagrams (flow charts and tree diagrams) it is usually written by a generative grammarian, whereas if it includes graphs, it is usually written by a sociolinguist. But computational linguists such as Herdan (1960) and Carroll (1968, 1971) have argued persuasively that statistics can be used to estimate not only the frequency of use of specific words, but also to estimate the size of the vocabulary from which they are drawn; hence, statistical evidence is relevant to both style and competence. In this article, I attempt to demonstrate the usefulness (and the limitations) of one statistical measure--a type-token vocabulary curve--in describing both the style and the competence of a variety of English and American authors. For this study, I wrote a TurboPascal program which reads text files from a Macintosh SE computer, counts each word (token) in a text, records each new vocabulary word (type) as it is encountered and computes the total number of tokens and the total types accumulated at that point in the text. I then applied the program to twenty passages written by thirteen different authors and compared their type-token statistics. The results are summarized in the table in (10) below. Youmans / Vocabulary / 2 Certain decisions--and compromises--must be made in any computerized study of vocabulary: for example, does time-worn count as one word (token) or two? are cat and cats different words or different forms of the same word (type)? For the purposes of computerized analysis of texts, the simplest (though not the most accurate) definition for printed word tokens is the one used in Kucera and Francis 1967 and in Francis and Kucera 1982 (p. 3): "Graphic word: a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuations marks." By this definition, can't, cannot, and babe-in-arms count as single words, whereas can not and can opener are two-word phrases. Such a definition can introduce statistical errors because the choice between, say, tear jerker, tear-jerker, and tearjerker reflects editorial conventions rather than genuine differences in vocabulary. Fortunately, the passages examined for this article contain few variants like these--with the notable exception of Finnegans Wake: from a statistical point of view, the size of the vocabulary in the Wake is inflated by Joyce's playful invention of compounds such as upturnpikepointandplace, which would have been written as phrases by nearly any other English author. Again in order to simplify computer analysis, "A 'distinct word' (type) can also be simply defined as a set of identical individual words" (Kucera and Francis, p. xxi). That is, Kucera and Francis count all and only identical alphanumeric strings as being the same graphic word (type). Differences between upper and lower case are ignored, resulting in occasional errors; for example, Brown (proper noun) and brown (adjective) count as the same word, as do Polish and polish. Similarly, bear (noun, 'mammal') and bear (verb, 'carry') count as the same word, while can't, cannot, and can not count as different words, as do inflectional forms such as bear, bears, and bear's. The authors admit that this sort of computerized count gives somewhat inaccurate results, since "distinct graphic word" is not identical with what most people mean by "distinct word." In practice, however, the two concepts are closely correlated, and in comparative studies such as this one, the precise definition used for "distinct word" is less important than ensuring that the definition is explicit and that it is applied consistently. Francis and Kucera 1982 takes their analysis a step farther. Graphic words are grouped into lemmas, defined as follows: "Lemma: a set of grammatical words having the same stem and/or meaning and belonging to the same major word class, differing only in inflection and/or spelling" (p. 3). Hence, the lemma be subsumes the inflectional forms been and being, the suppletive forms am, is, was, were, and even spelling and dialect variants such as are/ah, were/wuh. Most people, when speaking of an author's "total vocabulary," probably mean something like 'total number of lemmas' rather than 'total number of graphic/phonetic words'. For example, cat, cats, and cat's are not usually thought of as "different words," but as singular, plural, and possessive forms of the "same word." Thus, Thorndike and Lorge (1944, p. ix) count inflected forms of nouns, verbs, and adjectives "under the main word," and their list of "30,000 words" actually subsumes thousands of additional graphic words. Thorndike and Lorge do list suppletive forms separately, with separate entries for words such as am/be/is/was/were. In this they differ from Ogden (1934, p. 3), who describes Basic English as "a careful and systematic selection of 850 English words which will cover those needs of everyday life for which a vocabulary of 20,000 words is frequently employed." Ogden's claim borders on false advertisement, since his copyrighted list of 850 words is not only intended to subsume hundreds of additional inflected and derived words but also suppletive forms--the most spectacular example being he, which is listed as the lemma for twelve distinct graphic words: he, him, his, she, her, hers, it, its, they, them, their, and theirs. (This is the generic he run rampant!) Consequently, the 850-word vocabulary of Basic English actually expands to thousands of distinct graphic words, although these remain a small subset of the total vocabulary Youmans / Vocabulary / 3 of a typical native speaker of English. These examples from Ogden, Thorndike and Lorge, and Kucera and Francis illustrate that there is considerable variation in what counts as the "same word" (type) in statistical studies of vocabulary; hence, it is difficult to compare the results from different studies. For this article, I have used the definition of distinct word (type) which simplifies computer analysis--the one used in Kucera and Francis 1967: a word (type) is any distinctive, continuous string of alphanumeric characters (including hyphens and apostrophes but excluding other punctuation) that is preceded and followed by a space. Where possible, I have also selected samples of 2000 words (tokens) or more--the size of the samples in Kucera and Francis--so that my results for individual authors can be compared directly with theirs. Statistical studies of vocabulary almost ritualistically report the ratio between types and tokens for a given sample of text. However, this statistic turns out to be nearly useless, as an analysis of one set of type-token ratios will illustrate. In (1), using the MYSTAT software package, I have plotted the type-token ratios against the number of tokens for "Macbeth," which is a translation into Basic English by T. Takata of a passage from Charles Lamb's Stories from Shakespeare (Ogden, pp. 286-298). (1) Type-Token Ratios for "Macbeth" in Basic English The first sentence of the passage in Basic English can be used to illustrate how this curve is derived: (2) At the time when Duncan the Kind was King of Scotland, there was a great lord, named Macbeth. The first type-token ratio (for the word at) is 1/1 = 1.0; the second type-token ratio (for the first two words at the) is 2/2 = 1.0, and so on until the first repetition--the sixth word the, where the type-token ratio falls below 1.0 (5 types / 6 tokens = 0.833) and it Youmans / Vocabulary / 4 remains below 1.0 thereafter. The next repetition is the thirteenth word was, where the type-token ratio is 11/13 = 0.846. As the text unfolds, more repetitions occur, and the type-token ratio continues to fall--rapidly at first, and then more slowl