Scoring best-worst data in unbalanced many-item designs, with applications to crowdsourcing semantic judgments

Best-worst scaling is a judgment format in which participants are presented with a set of items and have to choose the superior and inferior items in the set. Best-worst scaling generates a large quantity of information per judgment because each judgment allows for inferences about the rank value of all unjudged items. This property of best-worst scaling makes it a promising judgment format for research in psychology and natural language processing concerned with estimating the semantic properties of tens of thousands of words. A variety of different scoring algorithms have been devised in the previous literature on best-worst scaling. However, due to problems of computational efficiency, these scoring algorithms cannot be applied efficiently to cases in which thousands of items need to be scored. New algorithms are presented here for converting responses from best-worst scaling into item scores for thousands of items (many-item scoring problems). These scoring algorithms are validated through simulation and empirical experiments, and considerations related to noise, the underlying distribution of true values, and trial design are identified that can affect the relative quality of the derived item scores. The newly introduced scoring algorithms consistently outperformed scoring algorithms used in the previous literature on scoring many-item best-worst data.

[1]  Randal S. Olson,et al.  Python machine learning : unlock deeper insights into machine learning with this vital guide to cutting-edge predictive analytics , 2015 .

[2]  R. Downey,et al.  Rating the ratings: Assessing the psychometric quality of rating data , 1980 .

[3]  Towhidul Islam,et al.  Conceptual Relations Between Expanded Rank Data and Models of the Unexpanded Rank Data , 2012 .

[4]  R. Rescorla A theory of pavlovian conditioning: The effectiveness of reinforcement and non-reinforcement , 1972 .

[5]  Saif Mohammad,et al.  NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets , 2013, *SEMEVAL.

[6]  M. Brysbaert,et al.  Norms of valence and arousal for 14,031 Spanish words , 2016, Behavior Research Methods.

[7]  Rebecca Treiman,et al.  The English Lexicon Project , 2007, Behavior research methods.

[8]  Markus J. Hofmann,et al.  Now you see it, now you don't: on emotion, context, and the algorithmic prediction of human imageability judgments , 2013, Front. Psychol..

[9]  Geoff Hollis,et al.  Extrapolating human judgments from skip-gram vector representations of word meaning , 2017, Quarterly journal of experimental psychology.

[10]  Michaël A. Stevens,et al.  Norms of age of acquisition and concreteness for 30,000 Dutch words. , 2014, Acta psychologica.

[11]  Jordan J. Louviere,et al.  Best-Worst Scaling: Theory, Methods and Applications , 2015 .

[12]  Saif Mohammad,et al.  Sentiment Analysis of Short Informal Texts , 2014, J. Artif. Intell. Res..

[13]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[14]  Amy Beth Warriner,et al.  Sliding Into Happiness: A New Tool for Measuring Affective Responses to Words , 2017, Canadian journal of experimental psychology = Revue canadienne de psychologie experimentale.

[15]  Jane Ogden,et al.  How meaningful are data from Likert scales? An evaluation of how ratings are made and the role of the response shift in the socially disadvantaged , 2012, Journal of health psychology.

[16]  Saif Mohammad,et al.  NRC-Canada-2014: Recent Improvements in the Sentiment Analysis of Tweets , 2014, SemEval@COLING.

[17]  Marchell E. Thurow,et al.  Cortisol variation in humans affects memory for emotionally laden and neutral information. , 2003, Behavioral neuroscience.

[18]  Marc Brysbaert,et al.  The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words , 2011, Behavior Research Methods.

[19]  Saif Mohammad,et al.  Sentiment Composition of Words with Opposing Polarities , 2016, NAACL.

[20]  Geoff Hollis,et al.  NUANCE: Naturalistic University of Alberta Nonlinear Correlation Explorer , 2006, Behavior research methods.

[21]  R. Rescorla,et al.  A theory of Pavlovian conditioning : Variations in the effectiveness of reinforcement and nonreinforcement , 1972 .

[22]  M. Lodge,et al.  The Automaticity of Affect for Political Leaders, Groups, and Issues: An Experimental Test of the Hot Cognition Hypothesis , 2005 .

[23]  M. Bradley,et al.  Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings , 1999 .

[24]  Michaël A. Stevens,et al.  Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment , 2015, Quarterly journal of experimental psychology.

[25]  Geoff Hollis,et al.  The principals of meaning: Extracting semantic dimensions from co-occurrence models of semantics , 2016, Psychonomic Bulletin & Review.

[26]  Marc Brysbaert,et al.  How useful are corpus-based methods for extrapolating psycholinguistic variables? , 2015, Quarterly journal of experimental psychology.

[27]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[28]  S. Hamann,et al.  Positive and negative emotional verbal stimuli elicit activity in the left amygdala , 2002, Neuroreport.

[29]  Saif Mohammad,et al.  Capturing Reliable Fine-Grained Sentiment Associations by Crowdsourcing and Best–Worst Scaling , 2016, NAACL.

[30]  E. Ambrosini,et al.  The adaptation of the Affective Norms for English Words (ANEW) for Italian , 2014, Behavior research methods.

[31]  A. A. J. Marley,et al.  A formal and empirical comparison of two score measures for best–worst scaling , 2016 .

[32]  M. Bradley,et al.  Affective Normsfor English Words (ANEW): Stimuli, instruction manual and affective ratings (Tech Report C-1) , 1999 .

[33]  Benny B. Briesemeister,et al.  Avoid violence, rioting, and outrage; approach celebration, delight, and strength: Using large text corpora to compute valence, arousal, and the basic emotions , 2015, Quarterly journal of experimental psychology.

[34]  B. Weijters,et al.  The effect of rating scale format on response styles: the number of response categories and response catgory labels , 2010 .

[35]  S. Lipovetsky,et al.  Best-Worst Scaling in analytical closed-form solution , 2014 .

[36]  R. Baayen,et al.  Frequency in lexical processing , 2016 .

[37]  C. Osgood,et al.  The Measurement of Meaning , 1958 .

[38]  Amy Beth Warriner,et al.  Norms of valence, arousal, and dominance for 13,915 English lemmas , 2013, Behavior Research Methods.

[39]  Melvin J Yap,et al.  The Calgary semantic decision project: concrete/abstract decision data for 10,000 English words , 2016, Behavior Research Methods.

[40]  M. Brysbaert,et al.  Age-of-acquisition ratings for 30,000 English words , 2012, Behavior research methods.

[41]  Geoff Hollis,et al.  NUANCE 3.0: Using genetic programming to model variable relationships , 2006, Behavior research methods.

[42]  Kamil K. Imbir,et al.  Affective norms for 1,586 polish words (ANPW): Duality-of-mind approach , 2014, Behavior research methods.

[43]  Amy Beth Warriner,et al.  Emotion and language: valence and arousal affect word recognition. , 2014, Journal of experimental psychology. General.

[44]  Amy Beth Warriner,et al.  Concreteness ratings for 40 thousand generally known English word lemmas , 2014, Behavior research methods.