Grades are not Normal: Improving Exam Score Models Using the Logit-Normal Distribution

Understanding exam score distributions has implications for item response theory (IRT), grade curving, and downstream modeling tasks such as peer grading. Historically, grades have been assumed to be normally distributed, and to this day the normal is the ubiquitous choice for modeling exam scores. While this is a good assumption for tests comprised of equally-weighted dichotomous items, it breaks down on the highly polytomous domain of undergraduate-level exams. The logit-normal is a natural alternative because it is has a bounded range, can represent asymmetric distributions, and lines up with IRT models that perform logistic transformations on normally distributed abilities. To tackle this question, we analyze an anonymized dataset from Gradescope consisting of over 4000 highly polytomous undergraduate exams. We show that the logit-normal better models this data without having more parameters than the normal. In addition, we propose a new continuous polytomous IRT model that reduces the number of item-parameters by using a logit-normal assumption at the item level.

[1]  N. L. Johnson,et al.  Systems of frequency curves generated by methods of translation. , 1949, Biometrika.

[2]  A. Hald Maximum Likelihood Estimation of the Parameters of a Normal Distribution which is Truncated at a Known Point , 1949 .

[3]  F. Samejima Homogeneous case of the continuous response model , 1973 .

[4]  K. Mellanby Grade expectations , 1977, Nature.

[5]  J. Atchison,et al.  Logistic-normal distributions:Some properties and uses , 1980 .

[6]  R. D'Agostino,et al.  A Suggestion for Using Powerful and Informative Tests of Normality , 1990 .

[7]  E. Muraki A GENERALIZED PARTIAL CREDIT MODEL: APPLICATION OF AN EM ALGORITHM , 1992 .

[8]  R. Sternberg The School Bell and The Bell Curve. Why They Don't Mix , 1996 .

[9]  F. Samejima Graded Response Model , 1997 .

[10]  B. Morgan,et al.  The Ethics of Faculty Behavior: Students' and Professors' Views , 2001 .

[11]  N. Dorans Recentering and Realigning the SAT Score Distributions: How and Why. , 2002 .

[12]  A. Kohn The Dangerous Myth of Grade Inflation. , 2002 .

[13]  Kimberley Buster-Williams Grade Inflation-A Crisis in College Education , 2004 .

[14]  Frank Lad,et al.  Two Moments of the Logitnormal Distribution , 2008, Commun. Stat. Simul. Comput..

[15]  Lynn Fendler,et al.  THE HISTORY OF THE BELL CURVE: SORTING AND THE IDEA OF NORMAL , 2008 .

[16]  Ronald Wright,et al.  The Impact of Grading on the Curve: A Simulation Analysis , 2008 .

[17]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[18]  Cengiz Zopluoglu,et al.  A comparison of two estimation algorithms for Samejima’s continuous IRT model , 2013, Behavior research methods.

[19]  Zhenghao Chen,et al.  Tuned Models of Peer Assessment in MOOCs , 2013, EDM.

[20]  Albert Maydeu-Olivares Goodness-of-Fit Assessment of Item Response Theory Models , 2013 .

[21]  Thorsten Joachims,et al.  Methods for ordinal peer grading , 2014, KDD.

[22]  Shinichi Nakagawa,et al.  Gender differences in individual variation in academic grades fail to fit expected patterns for STEM , 2018, Nature Communications.

[23]  Marti A. Hearst,et al.  How do professors format exams?: an analysis of question variety at scale , 2018, L@S.