When is best-worst best? A comparison of best-worst scaling, numeric estimation, and rating scales for collection of semantic norms

Large-scale semantic norms have become both prevalent and influential in recent psycholinguistic research. However, little attention has been directed towards understanding the methodological best practices of such norm collection efforts. We compared the quality of semantic norms obtained through rating scales, numeric estimation, and a less commonly used judgment format called best-worst scaling. We found that best-worst scaling usually produces norms with higher predictive validities than other response formats, and does so requiring less data to be collected overall. We also found evidence that the various response formats may be producing qualitatively, rather than just quantitatively, different data. This raises the issue of potential response format bias, which has not been addressed by previous efforts to collect semantic norms, likely because of previous reliance on a single type of response format for a single type of semantic judgment. We have made available software for creating best-worst stimuli and scoring best-worst data. We also made available new norms for age of acquisition, valence, arousal, and concreteness collected using best-worst scaling. These norms include entries for 1,040 words, of which 1,034 are also contained in the ANEW norms (Bradley & Lang, Affective norms for English words (ANEW): Instruction manual and affective ratings (pp. 1-45). Technical report C-1, the center for research in psychophysiology, University of Florida, 1999).

[1]  Michaël A. Stevens,et al.  Norms of age of acquisition and concreteness for 30,000 Dutch words. , 2014, Acta psychologica.

[2]  Zachary Estes,et al.  Automatic vigilance for negative words is categorical and general , 2008 .

[3]  M. Bradley,et al.  Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings , 1999 .

[4]  Chris Westbury,et al.  Pay no attention to that man behind the curtain: Explaining semantics without semantics , 2016 .

[5]  Amy Beth Warriner,et al.  Concreteness ratings for 40 thousand generally known English word lemmas , 2014, Behavior research methods.

[6]  Joseph O'Rourke,et al.  The Living Word Vocabulary: The Words We Know, A National Vocabulary Inventory , 1976 .

[7]  Amy Beth Warriner,et al.  Norms of valence, arousal, and dominance for 13,915 English lemmas , 2013, Behavior Research Methods.

[8]  R. Rescorla,et al.  A theory of Pavlovian conditioning : Variations in the effectiveness of reinforcement and nonreinforcement , 1972 .

[9]  Geoff Hollis,et al.  The principals of meaning: Extracting semantic dimensions from co-occurrence models of semantics , 2016, Psychonomic Bulletin & Review.

[10]  Marc Brysbaert,et al.  Test-based age-of-acquisition norms for 44 thousand English word meanings , 2017, Behavior research methods.

[11]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[12]  Marc Brysbaert,et al.  The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words , 2011, Behavior Research Methods.

[13]  Saif Mohammad,et al.  Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation , 2017, ACL.

[14]  Lawrence W. Barsalou,et al.  Perceptions of perceptual symbols , 1999, Behavioral and Brain Sciences.

[15]  Geoff Hollis,et al.  Extrapolating human judgments from skip-gram vector representations of word meaning , 2017, Quarterly journal of experimental psychology.

[16]  G. Vigliocco,et al.  Emotion words, regardless of polarity, have a processing advantage over neutral words , 2009, Cognition.

[17]  Dermot Lynott,et al.  Strength of perceptual experience predicts word processing performance better than concreteness or imageability , 2012, Cognition.

[18]  Marc Brysbaert,et al.  How useful are corpus-based methods for extrapolating psycholinguistic variables? , 2015, Quarterly journal of experimental psychology.

[19]  Geoff Hollis,et al.  Scoring best-worst data in unbalanced many-item designs, with applications to crowdsourcing semantic judgments , 2018, Behavior research methods.

[20]  Ping Li,et al.  Does frequency count? Parental input and the acquisition of vocabulary , 2008, Journal of Child Language.

[21]  Saif Mohammad,et al.  Capturing Reliable Fine-Grained Sentiment Associations by Crowdsourcing and Best–Worst Scaling , 2016, NAACL.

[22]  Lewis Pollock,et al.  Statistical and methodological problems with concreteness and other semantic variables: A list memory experiment case study , 2018, Behavior research methods.

[23]  Stavroula Kousta,et al.  Toward a theory of semantic representation , 2009, Language and Cognition.

[24]  R. Baayen,et al.  Frequency in lexical processing , 2016 .

[25]  Marco Marelli,et al.  Social Media and Language Processing: How Facebook and Twitter Provide the Best Frequency Estimates for Studying Word Recognition , 2016, Cogn. Sci..

[26]  J. Adelman,et al.  Automatic vigilance for negative words in lexical decision and naming: comment on Larsen, Mercer, and Balota (2006). , 2008, Emotion.

[27]  M. Brysbaert,et al.  Norms of valence and arousal for 14,031 Spanish words , 2016, Behavior Research Methods.

[28]  Rebecca Treiman,et al.  The English Lexicon Project , 2007, Behavior research methods.

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Dušica Filipović Đurđević,et al.  An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. , 2011, Psychological review.

[31]  R. W. Stowe,et al.  Context availability and lexical decisions for abstract and concrete words , 1988 .

[32]  Benjamin Naumann,et al.  Mental Representations A Dual Coding Approach , 2016 .

[33]  Amy Beth Warriner,et al.  Emotion and language: valence and arousal affect word recognition. , 2014, Journal of experimental psychology. General.

[34]  Andrew W. Ellis,et al.  Age of Acquisition Norms for a Large Set of Object Names and Their Relation to Adult Estimates and Other Variables , 1997 .

[35]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[36]  Melvin J Yap,et al.  The Calgary semantic decision project: concrete/abstract decision data for 10,000 English words , 2016, Behavior Research Methods.

[37]  M. Brysbaert,et al.  Age-of-acquisition ratings for 30,000 English words , 2012, Behavior research methods.

[38]  David P. Vinson,et al.  How does emotional content affect lexical processing? , 2013, CogSci.