The Impact of Design Decisions on Measurement Accuracy Demonstrated Using the Hierarchical Rater Model

When humans assign ratings in testing contexts, concern arises about whether rater effects impact the accuracy of the resulting measures. Those who lead scoring efforts implement several activities and utilize various designs to minimize the impact of these rater errors. This article uses the Hierarchical Rater Model (HRM) to demonstrate how the magnitude of rater errors and numbers of ratings associated with various measurement facets (e.g., raters & items) impact the accuracy of measures. Additionally, we demonstrate how the level at which decisions are made about the measures (e.g., test taker item scores, test taker total scores, test taker classifications) impact measurement accuracy.

[1]  Brian W. Junker,et al.  Markov Chain Monte Carlo for Item Response Models , 2016 .

[2]  Xiaohong Gao,et al.  Generalizability Analyses of Work Keys Listening and Writing Tests , 1995 .

[3]  Yuelin Li,et al.  Using R and WinBUGS to fit a generalized partial credit model for developing and evaluating patient‐reported outcomes assessments , 2012, Statistics in medicine.

[4]  Brian W. Junker,et al.  Applications and Extensions of MCMC in IRT: Multiple Item Types, Missing Data, and Rated Responses , 1999 .

[5]  G. Masters A rasch model for partial credit scoring , 1982 .

[6]  John R. Donoghue,et al.  An Empirical Examination of the IRT Information of Polytomously Scored Reading Items Under the Generalized Partial Credit Model , 1994 .

[7]  Machteld Hoskens,et al.  The Rater Bundle Model , 2001 .

[8]  Richard J. Patz,et al.  Hierarchical Rater Models , 2016 .

[9]  J. Linacre,et al.  Many-facet Rasch measurement , 1994 .

[10]  Richard J. Patz,et al.  A Straightforward Approach to Markov Chain Monte Carlo Methods for Item Response Models , 1999 .

[11]  Nigel O'Brian,et al.  Generalizability Theory I , 2003 .

[12]  C. Spearman CORRELATION CALCULATED FROM FAULTY DATA , 1910 .

[13]  D B Rubin,et al.  Markov chain Monte Carlo methods in biostatistics , 1996, Statistical methods in medical research.

[14]  R. Fisher FREQUENCY DISTRIBUTION OF THE VALUES OF THE CORRELATION COEFFIENTS IN SAMPLES FROM AN INDEFINITELY LARGE POPU;ATION , 1915 .

[15]  Matthew S. Johnson,et al.  A Hierarchical Rater Model for Constructed Responses, with a Signal Detection Rater Model , 2011 .

[16]  W. Brown SOME EXPERIMENTAL RESULTS IN THE CORRELATION OF MENTAL ABILITIES1 , 1910 .

[17]  Richard J. Patz,et al.  The Hierarchical Rater Model for Rated Test Items and its Application to Large-Scale Educational Assessment Data , 2002 .

[18]  Mark Wilson,et al.  Real-time feedback on rater drift in constructed-response items: An example from the golden state examination , 2001 .

[19]  Huub H. F. M. Verstralen,et al.  An IRT Model for Multiple Raters , 2001 .

[20]  Dorothy T. Thayer,et al.  A SIMULATION STUDY OF THE EFFECT OF RATER DESIGNS ON ABILITY ESTIMATION , 2001 .

[21]  Martyn Plummer,et al.  JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling , 2003 .