Four Years in Review: Statistical Practices of Likert Scales in Human-Robot Interaction Studies

As robots become more prevalent, the importance of the field of human-robot interaction (HRI) grows accordingly. As such, we should endeavor to employ the best statistical practices. Likert scales are commonly used metrics in HRI to measure perceptions and attitudes. Due to misinformation or honest mistakes, most HRI researchers do not adopt best practices when analyzing Likert data. We conduct a review of psychometric literature to determine the current standard for Likert scale design and analysis. Next, we conduct a survey of four years of the International Conference on Human-Robot Interaction (2016 through 2019) and report on incorrect statistical practices and design of Likert scales. During these years, only 3 of the 110 papers applied proper statistical testing to correctly-designed Likert scales. Our analysis suggests there are areas for meaningful improvement in the design and testing of Likert scales. Lastly, we provide recommendations to improve the accuracy of conclusions drawn from Likert data.

[1]  R. L. Herron,et al.  Use and Misuse of the Likert Item Responses and Other Ordinal Measures , 2015, International journal of exercise science.

[2]  Gene L. Theodori,et al.  Another Look at Likert Scales , 2016 .

[3]  J. Gliem,et al.  Calculating, Interpreting, And Reporting Cronbach’s Alpha Reliability Coefficient For Likert-Type Scales , 2003 .

[4]  R. Warner Applied Statistics: From Bivariate through Multivariate Techniques [with CD-ROM]. , 2007 .

[5]  Tibert Verhagen,et al.  Toward a Better Use of the Semantic Differential in IS Research: An Integrative Framework of Suggested Action , 2015, J. Assoc. Inf. Syst..

[6]  R. Clifford Blair A Reaction to “Consequences of Failure to Meet Assumptions Underlying the Fixed Effects Analysis of Variance and Covariance” , 1981 .

[7]  Rocco J. Perla,et al.  Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends about Likert Scales and Likert Response Formats and their Antidotes , 2007 .

[8]  Paul D. White,et al.  Comparing Two Samples from an Individual Likert Question , 2017 .

[9]  Robert Plomin,et al.  Genetics of Callous-Unemotional Behavior in Children , 2013, PloS one.

[10]  S. Sawilowsky,et al.  Analysis of Likert scale data in disability and medical rehabilitation research. , 1998 .

[11]  F. J. Klopfer,et al.  The "Cannot Decide" Option in Thurstone-Type Attitude Scales , 1978 .

[12]  Peter C Austin,et al.  Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health. , 2006, Journal of clinical epidemiology.

[13]  Monica Martinussen,et al.  Likert-based vs. semantic differential-based scorings of positive psychological constructs: A psychometric comparison of two versions of a scale measuring resilience. , 2006 .

[14]  G. Meek,et al.  Comparison of the t vs. Wilcoxon Signed-Rank Test for Likert Scale Data & Small Samples , 2007 .

[15]  Jesús M. Alvarado,et al.  Developing Multidimensional Likert Scales Using Item Factor Analysis , 2016 .

[16]  Ankit Shah,et al.  Appraisal of Statistical Practices in HRI vis-a-vis the T-Test for Likert Items/Scales , 2016, AAAI Fall Symposia.

[17]  Hae-Young Kim,et al.  Statistical notes for clinical researchers: post-hoc multiple comparisons , 2015, Restorative dentistry & endodontics.

[18]  Robbert Sanderman,et al.  Correction: Ineffectiveness of Reverse Wording of Questionnaire Items: Let’s Learn from Cows in the Rain , 2013, PLoS ONE.

[19]  K. Taber The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education , 2017, Research in Science Education.

[20]  R. Tourangeau,et al.  Fast times and easy questions: the effects of age, experience and question complexity on web survey response times , 2008 .

[21]  Kristin E. Schaefer,et al.  Measuring Trust in Human Robot Interactions: Development of the “ Trust Perception Scale-HRI ” , 2016 .

[22]  J. Rossiter,et al.  The Predictive Validity of Multiple-Item versus Single-Item Measures of the Same Constructs , 2007 .

[23]  R. Likert “Technique for the Measurement of Attitudes, A” , 2022, The SAGE Encyclopedia of Research Design.

[24]  Flavia Chiarotti,et al.  Detecting assumption violations in mixed-model analysis of variance. , 2004, Annali dell'Istituto superiore di sanita.

[25]  Insu Paek,et al.  In Search of the Optimal Number of Response Categories in a Rating Scale , 2014 .

[26]  Shing On Leung,et al.  Single-Item Measures for Subjective Academic Performance, Self-Esteem, and Socioeconomic Status , 2013 .

[27]  Rocco J. Perla,et al.  Resolving the 50‐year debate around using and misusing Likert scales , 2008, Medical education.

[28]  John P. Robinson,et al.  Questions and answers in attitude surveys , 1982 .

[29]  A. Vickers,et al.  COMPARISON OF AN ORDINAL AND A CONTINUOUS OUTCOME MEASURE OF MUSCLE SORENESS , 1999, International Journal of Technology Assessment in Health Care.

[30]  A. Joshi,et al.  Likert Scale: Explored and Explained , 2015 .

[31]  M. Sprangers,et al.  Is a single-item visual analogue scale as valid, reliable and responsive as multi-item scales in measuring quality of life? , 2004, Quality of Life Research.

[32]  John R. Rossiter,et al.  The C-OAR-SE procedure for scale development in marketing , 2002 .

[33]  J. Jacoby,et al.  Is There an Optimal Number of Alternatives for Likert Scale Items? Study I: Reliability and Validity , 1971 .

[34]  S. Jamieson Likert scales: how to (ab)use them , 2004, Medical education.

[35]  A. Diamantopoulos,et al.  Guidelines for choosing between multi-item and single-item scales for construct measurement: a predictive validity perspective , 2012 .

[36]  Steven J. Stroessner,et al.  The Robotic Social Attributes Scale (RoSAS): Development and Validation , 2017, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[37]  G. Glass,et al.  Consequences of Failure to Meet Assumptions Underlying the Fixed Effects Analyses of Variance and Covariance , 1972 .

[38]  B. Zumbo,et al.  Estimating Ordinal Reliability for Likert-Type and Ordinal Item Response Data: A Conceptual, Empirical, and Practical Guide. , 2012 .

[39]  Basu Prasad Subedi,et al.  Using Likert Type Data in Social Science Research: Confusion, Issues and Challenges , 2016 .

[40]  D. R. Johnson,et al.  Ordinal measures in multiple indicator models: A simulation study of categorization error. , 1983 .

[41]  Christine DiStefano,et al.  Wording Effects in Self-Esteem Scales: Methodological Artifact or Response Style? , 2003 .

[42]  Leonard J. Simms,et al.  Does the Number of Response Options Matter? Psychometric Perspectives Using Personality Questionnaire Data , 2019, Psychological assessment.

[43]  Björn Lantz,et al.  Equidistance of Likert-Type Scales and Validation of Inferential Methods Using Experiments and Simulations , 2013 .

[44]  Jon A. Krosnick,et al.  Satisficing in surveys: Initial evidence , 1996 .

[45]  Thomas J. Dormody,et al.  Analyzing Data Measured by Individual Likert-Type Items. , 1994 .

[46]  W. Penn Handwerker,et al.  Constructing Likert Scales: Testing the Validity and Reliability of Single Measures of Multidimensional Variables , 1996 .

[47]  J. B. Brooke,et al.  SUS: A 'Quick and Dirty' Usability Scale , 1996 .

[48]  S. Leung,et al.  Can Likert Scales be Treated as Interval Scales?—A Simulation Study , 2017 .

[49]  Rebecca F. Guy,et al.  The neutral point on a Likert scale. , 1977 .

[50]  Temple,et al.  Developing Likert-Scale Questionnaires , 2014 .

[51]  A. Colman,et al.  Optimal number of response categories in rating scales: reliability, validity, discriminating power, and respondent preferences. , 2000, Acta psychologica.

[52]  Shinichi Nakagawa A farewell to Bonferroni: the problems of low statistical power and publication bias , 2004, Behavioral Ecology.

[53]  B. Courtenay,et al.  The effects of a "don't know" response on Palmore's Facts on Aging quizzes. , 1985, The Gerontologist.

[54]  M. Tavakol,et al.  Making sense of Cronbach's alpha , 2011, International journal of medical education.

[55]  Alan R. Wagner,et al.  Robust Intelligence and Trust in Autonomous Systems , 2016, Springer US.

[56]  Evan F. Risko,et al.  Correlates of the Rosenberg Self-Esteem Scale Method Effects , 2006 .

[57]  R. Johns,et al.  One Size Doesn’t Fit All: Selecting Response Scales For Attitude Items , 2005 .

[58]  C. Bacter Making Sense of Research in Nursing, Health and Social Care , 2015 .

[59]  A. W. Bendig The reliability of self-ratings as a function of the amount of verbal anchoring and of the number of categories on the scale. , 1953 .

[60]  N. Balasubramanian,et al.  Likert Technique of Attitude Scale Construction in Nursing Research , 2012 .