Rater fairness in music performance assessment: Evaluating model-data fit and differential rater functioning

The purpose of this study was to investigate model-data fit and differential rater functioning in the context of large group music performance assessment using the Many-Facet Rasch Partial Credit Measurement Model. In particular, we sought to identify whether or not expert raters’ (N = 24) severity was invariant across four school levels (middle school, high school, collegiate, professional). Interaction analyses suggested that differential rater functioning existed for both the group of raters and some individual raters based on their expected locations on the logit scale. This indicates that expert raters did not demonstrate invariant levels of severity when rating subgroups of ensembles across the four school levels. Of the 92 potential pairwise interactions examined, 14 (15.2%) interactions were found to be statistically significant, indicating that 10 individual raters demonstrated differential severity across at least one school level. Interpretations of meaningful systematic patterns emerged for some raters after investigating individual pairwise interactions. Implications for improving the fairness and equity in large group music performance evaluations are discussed.

[1]  Raymond B. Cattell,et al.  The measurement of personality and behavior disorders by the I. P. A. T. Music Preference Test. , 1953 .

[2]  Rater-Rater Reliabilities in Judging Musical Performances , 1962 .

[3]  L. V. Jones,et al.  The measurement and prediction of judgment and choice. , 1970 .

[4]  Richard J. Colwell The evaluation of music teaching and learning , 1970 .

[5]  G. Duerksen Some Effects of Expectation on Evaluation of Recorded Musical Performance , 1972 .

[6]  B. Wright,et al.  Best test design , 1979 .

[7]  R. Downey,et al.  Rating the ratings: Assessing the psychometric quality of rating data , 1980 .

[8]  D. Charney,et al.  The Validity of Using Holistic Scoring to Evaluate Writing: A Critical Overview , 1984, Research in the Teaching of English.

[9]  James R. Austin The Effect of Music Contest Format on Self-Concept, Motivation, Achievement, and Attitude of Elementary Band Students , 1988 .

[10]  B. Huot,et al.  Reliability, Validity, and Holistic Scoring: What We Know and What We Need to Know , 1990 .

[11]  Mary E. Lunz,et al.  Judge Consistency and Severity Across Grading Periods , 1990 .

[12]  Walter M. Houston,et al.  Correcting Performance-Rating Errors in Oral Examinations , 1991, Evaluation & the health professions.

[13]  Janet Mills Assessing Musical Performance Musically , 1991 .

[14]  Program Evaluation for Secondary School Music Programs , 1992 .

[15]  George Engelhard,et al.  The Measurement of Writing Ability With a Many-Faceted Rasch Model , 1992 .

[16]  W. H. Angoff,et al.  Perspectives on differential item functioning methodology. , 1993 .

[17]  J. Davidson Visual Perception of Performance Manner in the Movements of Solo Musicians , 1993 .

[18]  Gillian Wigglesworth Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction , 1993 .

[19]  Patterns of rater behaviour in the assessment of an oral interaction test , 1994 .

[20]  George Engelhard,et al.  Examining Rater Errors in the Assessment of Written Composition With a Many-Faceted Rasch Model , 1994 .

[21]  Craig Willmore Hurst A Nationwide Investigation of High School Band Directors' Reasons for Participating in Music Competitions , 1994 .

[22]  Guy W. Forbes Evaluative Music Festivals and Contests—Are They Fair? , 1994 .

[23]  Jane W. Davidson,et al.  What Does the Visual Information Contained in Music Performances Offer the Observer? Some Preliminary Thoughts , 1995 .

[24]  George Engelhard,et al.  Evaluating Rater Accuracy in Performance Assessments. , 1996 .

[25]  Victor Ginsburgh,et al.  The Queen Elisabeth Musical Competition: how fair is the final ranking , 1996 .

[26]  George Engelhard Clarification to “Examining Rater Errors in the Assessment of Written Composition With a Many‐Faceted Rasch Model” , 1996 .

[27]  Grant Henning Accounting for nonsystematic error in performance ratings , 1996 .

[28]  J. Wapnick,et al.  Effects of Physical Attractiveness on Evaluation of Vocal Performance , 1997 .

[29]  Carolyn Siivola,et al.  Effect of Audience on Music Performance Anxiety , 1997 .

[30]  G Engelhard Constructing rater and task banks for performance assessments. , 1997, Journal of outcome measurement.

[31]  William Forde Thompson,et al.  Assessing Music Performance: Issues and Influences , 1998 .

[32]  J. Wapnick,et al.  Effects of Performer Attractiveness, Stage Behavior, and Dress on Violin Performance Evaluation , 1998 .

[33]  Educational Evaluation Standards for Educational and Psychological Testing , 1999 .

[34]  J. Wapnick,et al.  Effects of Performer Attractiveness, Stage Behavior, and Dress on Evaluation of Children's Piano Performances , 2000 .

[35]  Gudrun A. Bermingham Effects of Performers' External Characteristics on Performance Evaluations , 2000 .

[36]  S. Sireci Book Review: The New Rules of Measurement: What Every Psychologist and Educator Should Know , 2000 .

[37]  Jane W. Davidson,et al.  The Role of the Body in the Production and Perception of Solo Vocal Performance: A Case Study of Annie Lennox , 2001 .

[38]  Investigating Performance Evaluation by Assessors of Singers in a Music College Setting , 2001 .

[39]  B. Wright,et al.  Construction of measures from many-facet data. , 2002, Journal of applied measurement.

[40]  John M Linacre,et al.  Optimizing rating scale category effectiveness. , 2002, Journal of applied measurement.

[41]  Kimi Kondo-Brown,et al.  A FACETS analysis of rater bias in measuring Japanese second language writing performance , 2002 .

[42]  Ron Brooker,et al.  Examiner Perceptions of Using Criteria in Music Performance Assessment , 2002 .

[43]  T. Lumley Assessment criteria in a large-scale writing test: what do they really mean to the raters? , 2002 .

[44]  George Engelhard,et al.  MONITORING FACULTY CONSULTANT PERFORMANCE IN THE ADVANCED PLACEMENT ENGLISH LITERATURE AND COMPOSITION PROGRAM WITH A MANY-FACETED RASCH MODEL , 2003 .

[45]  Aaron Williamon,et al.  Evaluating Evaluation: Musical Performance Assessment as a Research Tool , 2003 .

[46]  Edward W Wolfe,et al.  Detecting and measuring rater effects using many-facet Rasch measurement: part I. , 2003, Journal of applied measurement.

[47]  Martin J. Bergee Faculty Interjudge Reliability of Music Performance Evaluation , 2003 .

[48]  Richard R. Sudweeks,et al.  A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing , 2004 .

[49]  Edward W Wolfe,et al.  Detecting and measuring rater effects using many-facet Rasch measurement: Part II. , 2004, Journal of applied measurement.

[50]  M. R. Espejo Applying the Rasch Model: Fundamental Measurement in the Human Sciences , 2004 .

[51]  Mark Wilson,et al.  Constructing Measures: An Item Response Modeling Approach , 2004 .

[52]  Thomas P. Rohrer The science and psychology of music performance: creative strategies for teaching and learning , 2004 .

[53]  Everett V. Smith,et al.  Introduction to Rasch measurement : theory, models and applications , 2004 .

[54]  A. Williamon Measuring performance enhancement in music , 2004 .

[55]  Tom Lumley,et al.  Assessing second language writing : the rater's perspective , 2005 .

[56]  Lourinda S. Crochet Repertoire selection practices of band directors as a function of teaching experience, training, instructional level, and degree of success , 2006 .

[57]  Timothy D. Brakel Inter-Judge Reliability of the Indiana State School Music Association High School Instrumental Festival , 2006 .

[58]  Charles E. Norris,et al.  An Examination of the Reliabilities of Two Choral Festival Adjudication Forms , 2007 .

[59]  Robert L. Johnson,et al.  Assessing Performance: Designing, Scoring, and Validating Performance Tasks , 2008 .

[60]  B. Silvey The Effects of Band Labels on Evaluators’ Judgments of Musical Performance , 2009 .

[61]  George Engelhard,et al.  Using Item Response Theory and Model—Data Fit to Conceptualize Differential Item and Person Functioning for Students With Disabilities , 2009 .

[62]  A Study of the Reliability of Adjudicator Ratings at the 2005 Virginia Band and Orchestra Directors Association State Marching Band Festivals , 2009 .

[63]  Ž. Pedišić,et al.  Handbook for Quantitative Methods , 2010 .

[64]  Friedrich Platz,et al.  When the Eye Listens: A Meta-analysis of How Audio-visual Presentation Enhances the Appreciation of Music Performance , 2012 .

[65]  George Engelhard,et al.  Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences , 2012 .

[66]  Christian Tarnai,et al.  What Do Music Preferences Reveal About Personality? A Cross-Cultural Replication Using Self-Ratings and Ratings of Music Samples , 2012 .

[67]  Yi Du,et al.  DIFFERENTIAL FACET FUNCTIONING DETECTION IN DIRECT WRITING ASSESSMENT , 2012 .

[68]  Phillip M. Hash An Analysis of the Ratings and Interrater Reliability of High School Band Contests , 2012 .

[69]  Stefanie A. Wind,et al.  Rating Quality Studies Using Rasch Measurement Theory , 2013 .

[70]  Rating Quality Studies Using Rasch Measurement Theory. Research Report 2013-3. , 2013 .

[71]  N. Srinivasan,et al.  Role of affect in decision making. , 2013, Progress in brain research.

[72]  Brian C. Wesolowski Documenting Student Learning in Music Performance , 2014 .

[73]  Juliane Hahn,et al.  Foundations Of Music Education , 2016 .

[74]  Tobias Bachmeier Architectural Acoustics Blending Sound Sources Sound Fields And Listeners , 2016 .

[75]  Brian C. Wesolowski,et al.  Assessing jazz big band performance: The development, validation, and application of a facet-factorial rating scale , 2016 .