vocd: A theoretical and empirical evaluation

A reliable index of lexical diversity (LD) has remained stubbornly elusive for over 60 years. Meanwhile, researchers in fields as varied as stylistics, neuropathology, language acquisition, and even forensics continue to use flawed LD indices — often ignorant that their results are questionable and in some cases potentially dangerous. Recently, an LD measurement instrument known as vocd has become the virtual tool of the LD trade. In this paper, we report both theoretical and empirical evidence that calls into question the rationale for vocd and also indicates that its reliability is not optimal. Although our evidence shows that vocd's output (D) is a relatively robust indicator of the aggregate probabilities of word occurrences in a text, we show that these probabilities — and thus also D — are affected by text length. Malvern, Richards, Chipere and Durán (2004) acknowledge that D (as calculated by vocd's default method) can be affected by text length, but claim that the effects are not significant for the ranges of text lengths with which they are concerned. In this paper, we explain why D is affected by text length, and demonstrate with an extensive empirical analysis that the effects of text length are significant over certain ranges, which we identify.

[1]  Kyle B. Dempsey,et al.  Identifying Text Genres Using Phrasal Verbs , 2006 .

[2]  David Malvern,et al.  Developmental trends in lexical diversity , 2004 .

[3]  David Malvern,et al.  Lexical Diversity and Language Development: Quantification and Assessment , 2004 .

[4]  David Malvern,et al.  Lexical Diversity and Language Development , 2004 .

[5]  Arthur C. Graesser,et al.  Variation in Language and Cohesion across Written and Spoken Registers , 2004 .

[6]  R. Hout,et al.  Lexical richness in the spontaneous speech of bilinguals , 2003 .

[7]  David L. Hoover,et al.  Another Perspective on Vocabulary Richness , 2003, Comput. Humanit..

[8]  Neeraja Sadagopan,et al.  Beginning to communicate after cochlear implantation: oral language development in a young child. , 2003, Journal of speech, language, and hearing research : JSLHR.

[9]  Marilyn Newhoff,et al.  Measures of lexical diversity in aphasia , 2003 .

[10]  Jan Avent,et al.  Reciprocal scaffolding: A context for communication treatment in aphasia , 2003 .

[11]  S. Ransdell,et al.  Socioeconomic and sociolinguistic predictors of children’s L2 and L1 writing quality , 2003 .

[12]  J. A. Smith,et al.  Stylistic Constancy and Change Across Literary Corpora: Using Measures of Lexical Richness to Date Works , 2002, Comput. Humanit..

[13]  Laurence B Leonard,et al.  Lexical diversity in the spontaneous speech of children with specific language impairment: application of D. , 2002, Journal of speech, language, and hearing research : JSLHR.

[14]  Kevin Colwell,et al.  Interviewing techniques and the assessment of statement credibility , 2002 .

[15]  Scott Jarvis,et al.  Short texts, best-fitting curves and new measures of lexical diversity , 2002 .

[16]  B. Grela Lexical verb diversity in children with Down syndrome , 2002, Clinical linguistics & phonetics.

[17]  Sameer Singh,et al.  A Pilot Study on Gender Differences in Conversational Speech on Lexical Richness Measures , 2001, Lit. Linguistic Comput..

[18]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[19]  Paul Meara,et al.  P-Lex: A Simple and Effective Way of Describing the lexical Characteristics of Short L2 Tests. , 2001 .

[20]  N. Ratner,et al.  Parental perceptions of children's communicative development at stuttering onset. , 2000, Journal of speech, language, and hearing research : JSLHR.

[21]  David Malvern,et al.  Measuring vocabulary diversity using dedicated software , 2000 .

[22]  Romola S. Bucks,et al.  Analysis of spontaneous, conversational speech in dementia of Alzheimer type: Evaluation of an objective technique for analysing lexical performance , 2000 .

[23]  John Read,et al.  Assessing Vocabulary by John Read , 2000 .

[24]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[25]  D. Holmes,et al.  A stylometric analysis of conversational speech of aphasic patients , 1996 .

[26]  Patricia L. Carrell,et al.  Learning Styles and Composition , 1993 .

[27]  Trong Wu,et al.  An accurate computation of the hypergeometric distribution function , 1993, TOMS.

[28]  N. Ratner,et al.  Patterns of parental vocabulary selection in speech to very young children , 1988, Journal of Child Language.

[29]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[30]  C. W. Hess,et al.  Sample size and type-token ratios for oral language of preschool children. , 1986, Journal of speech and hearing research.

[31]  Pierre J. L. Arnaud The lexical richness of L2 written productions and the validity of vocabulary tests , 1984 .

[32]  R. Quirk,et al.  A Corpus of English Conversation , 1980 .

[33]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[34]  J. Genderen,et al.  TESTING LAND-USE MAP ACCURACY , 1977 .

[35]  H. Sichel On a Distribution Law for Word Frequencies , 1975 .

[36]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[37]  John B. Carroll,et al.  Language and Thought , 1965 .

[38]  P. Guiraud Problèmes et méthodes de la statistique linguistique , 1960 .

[39]  M. C. Templin Certain language skills in children : their development and interrelationships , 1957 .

[40]  M. C. Templin Certain language skills in children , 1957 .

[41]  John W. Chotlos,et al.  IV. A statistical and comparative analysis of individual written language samples. , 1944 .