The use of rating and Likert scales in Natural Language Generation human evaluation tasks: A review and some recommendations

Rating and Likert scales are widely used in evaluation experiments to measure the quality of Natural Language Generation (NLG) systems. We review the use of rating and Likert scales for NLG evaluation tasks published in NLG specialized conferences over the last ten years (135 papers in total). Our analysis brings to light a number of deviations from good practice in their use. We conclude with some recommendations about the use of such scales. Our aim is to encourage the appropriate use of evaluation methodologies in the NLG community.

[1]  S. Jamieson Likert scales: how to (ab)use them , 2004, Medical education.

[2]  Gail M. Sullivan,et al.  Analyzing and interpreting data from likert-type scales. , 2013, Journal of graduate medical education.

[3]  Paul-Christian Bürkner,et al.  Ordinal Regression Models in Psychology: A Tutorial , 2019, Advances in Methods and Practices in Psychological Science.

[4]  J. Faraway Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models , 2005 .

[5]  G. Norman Likert scales, levels of measurement and the “laws” of statistics , 2010, Advances in health sciences education : theory and practice.

[6]  Rocco J. Perla,et al.  Resolving the 50‐year debate around using and misusing Likert scales , 2008, Medical education.

[7]  Godfrey Pell,et al.  Use and misuse of Likert scales , 2005, Medical education.

[8]  A. Joshi,et al.  Likert Scale: Explored and Explained , 2015 .

[9]  W. Revelle,et al.  Coefficients Alpha, Beta, Omega, and the glb: Comments on Sijtsma , 2009 .

[10]  R. Likert “Technique for the Measurement of Attitudes, A” , 2022, The SAGE Encyclopedia of Research Design.

[11]  Spencer E. Harpe How to analyze Likert and other rating scale data , 2015 .

[12]  Dimitra Dodou,et al.  Five-Point Likert Items: t test versus Mann-Whitney-Wilcoxon , 2010 .

[13]  Daniel M. McNeish,et al.  Psychological Methods Thanks Coefficient Alpha , We ’ ll Take It From Here , 2022 .

[14]  Judy Robertson,et al.  Likert-type scales, statistical methods, and effect sizes , 2012, Commun. ACM.

[15]  Torrin M. Liddell,et al.  Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong? , 2017, Journal of Experimental Social Psychology.

[16]  T. R. Knapp Treating ordinal scales as interval scales: an attempt to resolve the controversy. , 1990, Nursing research.

[17]  M. Urbanchek,et al.  The Seven Deadly Sins of Statistical Analysis , 1996, Annals of plastic surgery.