How do professors format exams?: an analysis of question variety at scale

This study analyzes the use of paper exams in college-level STEM courses. It leverages a unique dataset of nearly 1,800 exams, which were scanned into a web application, then processed by a team of annotators to yield a detailed snapshot of the way instructors currently structure exams. The focus of the investigation is on the variety of question formats, and how they are applied across different course topics. The analysis divides questions according to seven top-level categories, finding significant differences among these in terms of positioning, use across subjects, and student performance. The analysis also reveals a strong tendency within the collection for instructors to order questions from easier to harder. A linear mixed effects model is used to estimate the reliability of different question types. Long writing questions stand out for their high reliability, while binary and multiple choice questions have low reliability. The model suggests that over three multiple choice questions, or over five binary questions, are required to attain the same reliability as a single long writing question. A correlation analysis across seven response types finds that student abilities for different questions types exceed 70 percent for all pairs, although binary and multiple-choice questions stand out for having unusually low correlations with all other question types.

[1]  C. Hoyt Test reliability estimated by analysis of variance , 1941 .

[2]  L. Cronbach An experimental comparison of the multiple true-false and multiple multiple-choice tests. , 1941 .

[3]  L. Cronbach Coefficient alpha and the internal structure of tests , 1951 .

[4]  Why and How , 1965, Exceptional children.

[5]  R. N. Marso Test Item Arrangement, Testing Time, and Performance. , 1970 .

[6]  R. L. Ebel,et al.  Essentials of educational measurement , 1972 .

[7]  J. Rich That Was A Good Test , 1972 .

[8]  D. Frisbie Multiple Choice Versus True-False: A Comparison of Reliabilities and Concurrent Validities. , 1973 .

[9]  R. Hambleton,et al.  The Effects of Item Order on Test Performance and Stress , 1974 .

[10]  R. L. Ebel CAN TEACHERS WRITE GOOD TRUE‐FALSE TEST ITEMS? , 1975 .

[11]  S. Lowry,et al.  EFFECTS OF ITEM ARRANGEMENT, KNOWLEDGE OF ARRANGEMENT TEST ANXIETY AND SEX ON TEST PERFORMANCE , 1982 .

[12]  Rondeau G. Laffitte Effects of Item Order on Achievement Test Scores and Students’ Perception of Test Difficulty , 1984 .

[13]  Rowland C. Chidomere Test Item Arrangement and Student Performance in Principles of Marketing Examination: A Replication Study , 1989 .

[14]  H. Wainer,et al.  Are Tests Comprising Both Multiple‐Choice and Free‐Response Items Necessarily Less Unidimensional Than Multiple‐Choice Tests?An Analysis of Two Tests , 1994 .

[15]  B. Bridgeman,et al.  Success in college for students with discrepancies between performance on multiple-choice and essay tests. , 1996 .

[16]  N. E. Gronlund Assessment of Student Achievement. Sixth Edition. , 1998 .

[17]  K. Scouller The influence of assessment method on students' learning approaches: Multiple choice question examination versus assignment essay , 1998 .

[18]  W. Becker,et al.  The Relationship between Multiple Choice and Essay Response Questions in Assessing Economics Understanding , 1999 .

[19]  D. Nicol E‐assessment by design: using multiple‐choice tests to good effect , 2007 .

[20]  Deborah Camp,et al.  Rethinking Dyslexia, Scripted Reading, and Federal Mandates: The More Things Change, the More They Stay the Same , 2007 .

[21]  E. Palmer,et al.  Assessment of higher order cognitive skills in undergraduate education: modified essay or multiple choice questions? Research paper , 2007, BMC medical education.

[22]  T. Pettijohn,et al.  Multiple-Choice Exam Question Order Influences on Student Performance, Completion Time, and Perceptions. , 2007 .

[23]  Erik Meijer,et al.  Resampling Multilevel Models , 2008 .

[24]  Dominique P. Rauch,et al.  Multiple-choice versus open-ended response formats of reading test items: A two-dimensional IRT analysis , 2010 .

[25]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.