Adding part-of-speech information to the SUBTLEX-US word frequencies

The SUBTLEX-US corpus has been parsed with the CLAWS tagger, so that researchers have information about the possible word classes (parts‐of‐speech, or PoSs) of the entries. Five new columns have been added to the SUBTLEX-US word frequency list: the dominant (most frequent) PoS for the entry, the frequency of the dominant PoS, the frequency of the dominant PoS relative to the entry’s total frequency, all PoSs observed for the entry, and the respective frequencies of these PoSs. Because the current definition of lemma frequency does not seem to provide word recognition researchers with useful information (as illustrated by a comparison of the lemma frequencies and the word form frequencies from the Corpus of Contemporary American English), we have not provided a column with this variable. Instead, we hope that the full list of PoS frequencies will help researchers to collectively determine which combination of frequencies is the most informative.

[1]  E Emmanu SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles , 2010 .

[2]  Marc Brys,et al.  Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English , 2009 .

[3]  F. Pulvermüller,et al.  Words in the brain's language , 1999, Behavioral and Brain Sciences.

[4]  Marc Brysbaert,et al.  SUBTLEX-ESP: Spanish word frequencies based on film subtitles , 2011 .

[5]  Marc Brysbaert,et al.  The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words , 2011, Behavior Research Methods.

[6]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[7]  A. Jongman,et al.  Processing of English inflectional morphology , 1997, Memory & cognition.

[8]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[9]  M. Brysbaert,et al.  The use of film subtitles to estimate word frequencies , 2007, Applied Psycholinguistics.

[10]  M. Brysbaert,et al.  SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles , 2010, PloS one.

[11]  Rebecca Treiman,et al.  The English Lexicon Project , 2007, Behavior research methods.

[12]  R. Baayen,et al.  Putting the bits together: an information theoretical perspective on morphological processing , 2004, Cognition.

[13]  R. H. Baayen,et al.  The CELEX Lexical Database (CD-ROM) , 1996 .

[14]  Ping Li,et al.  Lexical representation of nouns and verbs in the late bilingual brain , 2011, Journal of Neurolinguistics.

[15]  M. Carreiras,et al.  Subtitle-Based Word Frequencies as the Best Estimate of Reading Behavior: The Case of Greek , 2010, Front. Psychology.

[16]  K. Rastle,et al.  The processing of singular and plural nouns in French and English , 2004 .

[17]  Roger Garside The robust tagging of unrestricted text: the BNC experience , 1996 .

[18]  A. Jacobs,et al.  The word frequency effect: a review of recent developments and implications for the choice of frequency estimates in German. , 2011, Experimental psychology.

[19]  M. Brysbaert,et al.  Assessing the Usefulness of Google Books’ Word Frequencies for Psycholinguistic Research on Word Processing , 2011, Front. Psychology.

[20]  R. Baayen,et al.  Morphological influences on the recognition of monosyllabic monomorphemic words , 2006 .

[21]  Zuraidah Mohd Don,et al.  The notion of a “lemma”: Headwords, roots and lexical sets , 2004 .

[22]  R. Baayen,et al.  Singulars and plurals in Dutch: Evidence for a parallel dual-route model , 1997 .