CurmElo: The theory and practice of a forced-choice approach to producing preference rankings

We introduce CurmElo, a forced-choice approach to producing a preference ranking of an arbitrary set of objects that combines the Elo algorithm with novel techniques for detecting and correcting for (1) preference heterogeneity induced polarization in preferences among raters, and (2) intransitivity in preference rankings. We detail the application of CurmElo to the problem of generating approximately preference-neutral identifiers, in this case four-letter and five-letter nonsense words patterned on the phonological conventions of the English language, using a population of Amazon Mechanical Turk workers. We find evidence that human raters have significant non-uniform preferences over these nonsense words, and we detail the consequences of this finding for social science work that utilizes identifiers without accounting for the bias this can induce. In addition, we describe how CurmElo can be used to produce rankings of arbitrary features or dimensions of preference of a set of objects relative to a population of raters.

[1]  V. Ponsoda,et al.  Controlling for Response Biases in Self-Report Scales: Forced-Choice vs. Psychometric Modeling of Likert Items , 2019, Front. Psychol..

[2]  O. Wilhelm,et al.  Forced-Choice Versus Likert Responses on an Occupational Big Five Questionnaire , 2019, Journal of Individual Differences.

[3]  Elly Stolk,et al.  Overview, Update, and Lessons Learned From the International EQ-5D-5L Valuation Work: Version 2 of the EQ-5D-5L Valuation Protocol. , 2019, Value in health : the journal of the International Society for Pharmacoeconomics and Outcomes Research.

[4]  C. Kam Why Do We Still Have an Impoverished Understanding of the Item Wording Effect? An Empirical Examination , 2018 .

[5]  D. Aldous Elo Ratings and the Sports Model: A Neglected Topic in Applied Probability? , 2017 .

[6]  Sukanya Wichchukit,et al.  The evolution of paired preference tests from forced choice to the use of ‘No Preference’ options, from preference frequencies to d′ values, from placebo pairs to signal detection , 2017 .

[7]  Adrian Staub,et al.  Eye movements in forced-choice recognition: Absolute judgments can preclude relative judgments , 2017 .

[8]  Michael Luca,et al.  Racial Discrimination in the Sharing Economy: Evidence from a Field Experiment , 2016 .

[9]  Morten H. Christiansen,et al.  Arbitrariness, Iconicity, and Systematicity in Language , 2015, Trends in Cognitive Sciences.

[10]  Pierre-Emmanuel Jabin,et al.  A Continuous Model For Ratings , 2015, SIAM J. Appl. Math..

[11]  Gary Lupyan,et al.  Meaningless words promote meaningful categorization , 2014, Language and Cognition.

[12]  Willem E. Saris,et al.  Choosing the Number of Categories in Agree–Disagree Scales , 2014 .

[13]  J. Pearl Detecting Latent Heterogeneity , 2013, Probabilistic and Causal Inference.

[14]  Manuela Cattelan,et al.  Models for Paired Comparison Data: A Review with Emphasis on Dependent Data , 2012, 1210.1016.

[15]  Andrew Leigh,et al.  Does Ethnic Discrimination Vary Across Minority Groups? Evidence from a Field Experiment , 2012 .

[16]  Joseph Bonneau,et al.  The Science of Guessing: Analyzing an Anonymized Corpus of 70 Million Passwords , 2012, 2012 IEEE Symposium on Security and Privacy.

[17]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[18]  P. Brockhoff,et al.  Thurstonian models for sensory discrimination tests as generalized linear models , 2010 .

[19]  Patrick Sturgis,et al.  Middle Alternatives Revisited , 2010 .

[20]  J. V. Tucker,et al.  Unifying computers and dynamical systems using the theory of synchronous concurrent algorithms , 2009, Appl. Math. Comput..

[21]  Rocco J. Perla,et al.  Resolving the 50‐year debate around using and misusing Likert scales , 2008, Medical education.

[22]  Dave Bartram,et al.  Increasing Validity with Forced-Choice Criterion Measurement Formats , 2007 .

[23]  Jian Bi STATISTICAL ANALYSES FOR R‐INDEX , 2006 .

[24]  William E. Loges,et al.  Rental Discrimination and Ethnicity in Names 1 , 2006 .

[25]  S. Jamieson Likert scales: how to (ab)use them , 2004, Medical education.

[26]  Steven D. Levitt,et al.  The Causes and Consequences of Distinctively Black Names , 2003 .

[27]  S. Chaiken,et al.  The Automatic Evaluation of Novel Stimuli , 2002, Psychological science.

[28]  Todd M. Bailey,et al.  Determinants of wordlikeness: Phonotactics or lexical neighborhoods? , 2001 .

[29]  D. Jackson,et al.  The Impact of Faking on Employment Tests: Does Forced Choice Offer a Solution? , 2000 .

[30]  David B Pisoni,et al.  Perception of Wordlikeness: Effects of Segment Probability and Length on the Processing of Nonwords. , 2000, Journal of memory and language.

[31]  John Coleman,et al.  Stochastic phonological grammars and acceptability , 1997, SIGMORPHON@EACL.

[32]  D. Alwin,et al.  Feeling Thermometers Versus 7-Point Scales , 1997 .

[33]  S. Lieberson,et al.  DISTINCTIVE AFRICAN AMERICAN NAMES: AN EXPERIMENTAL, HISTORICAL, AND LINGUISTIC ANALYSIS OF INNOVATION* , 1995 .

[34]  S. Lieberson,et al.  Children's First Names: An Empirical Study of Social Taste , 1992, American Journal of Sociology.

[35]  Michael O'Mahony,et al.  UNDERSTANDING DISCRIMINATION TESTS: A USER-FRIENDLY TREATMENT OF RESPONSE BIAS, RATING AND RANKING R-INDEX TESTS AND THEIR RELATIONSHIP TO SIGNAL DETECTION , 1992 .

[36]  H. Stern Are all linear paired comparison models empirically equivalent , 1992 .

[37]  Jon A. Krosnick,et al.  The Reliability of Survey Attitude Measurement , 1991 .

[38]  Bruce L. Riddle,et al.  Passwords in use in a university timesharing environment , 1989, Comput. Secur..

[39]  J. M. Nuttin Narcissism beyond Gestalt and awareness: The name letter effect , 1985 .

[40]  J. Ray The Comparative Validity of Likert, Projective, and Forced-Choice Indices of Achievement Motivation , 1980 .

[41]  John Brown,et al.  RECOGNITION ASSESSED BY RATING AND RANKING , 1974 .

[42]  S. Heywood The Popular Number Seven or Number Preference , 1972 .

[43]  L. Wrightsman,et al.  A Comparison of Two Methods of Attitude Measurement: Likert-Type and Forced Choice , 1960 .

[44]  H. D. Block,et al.  Random Orderings and Stochastic Theories of Responses (1960) , 1959 .

[45]  R. Likert “Technique for the Measurement of Attitudes, A” , 2022, The SAGE Encyclopedia of Research Design.

[46]  Skipper Seabold,et al.  Statsmodels: Econometric and Statistical Modeling with Python , 2010, SciPy.

[47]  M. Kubovy,et al.  The Predominance of Seven and the Apparent Spontaneity of Numerical Choices , 2005 .

[48]  R. Wright Phonetically Based Phonology: A review of perceptual cues and cue robustness , 2004 .

[49]  Michael Hammond,et al.  Gradience, Phonotactics and the Lexicon in English Phonology , 2004 .

[50]  J. Gliem,et al.  Calculating, Interpreting, And Reporting Cronbach’s Alpha Reliability Coefficient For Likert-Type Scales , 2003 .

[51]  R. Shiller COWLES FOUNDATION FOR RESEARCH IN ECONOMICS , 2002 .

[52]  J. Broome Ethics out of Economics: ‘Utility’ , 1999 .

[53]  Michael Hammond,et al.  The phonology of English : a prosodic optimality-theoretic approach , 1999 .

[54]  M. Diehl The Minimal Group Paradigm: Theoretical Explanations and Empirical Findings , 1990 .

[55]  A. Elo The rating of chessplayers, past and present , 1978 .

[56]  H. Tajfel,et al.  Social categorization and similarity in intergroup behaviour , 1973 .

[57]  H. Tajfel,et al.  Experiments in intergroup discrimination. , 1970, Scientific American.

[58]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[59]  D. M. Green,et al.  Signal detection theory and psychophysics , 1966 .