MONITORING FACULTY CONSULTANT PERFORMANCE IN THE ADVANCED PLACEMENT ENGLISH LITERATURE AND COMPOSITION PROGRAM WITH A MANY-FACETED RASCH MODEL

The purpose of this study was to examine, describe, evaluate, and compare the rating behavior of faculty consultants who scored essays written for the Advanced Placement English Literature and Composition (AP® ELC) Exam. Data from the 1999 AP ELC Exam were analyzed using FACETS (Linacre, 1998) and SAS. The faculty consultants were not all interchangeable in terms of the level of severity they exercised. If students' ratings had been adjusted for severity differences, the AP grades of about 30 percent of the students would have been different from the one they received. Almost all the differences were one grade or less. Adjusting ratings for faculty consultant severity differences would not impact some student subgroups more than others.

[1]  Joseph L. Fleiss,et al.  Balanced Incomplete Block Designs for Inter-Rater Reliability Studies , 1981 .

[2]  F. Lord Applications of Item Response Theory To Practical Testing Problems , 1980 .

[3]  E. Beale,et al.  Missing Values in Multivariate Analysis , 1975 .

[4]  C. I. Chase ESSAY TEST SCORING: INTERACTION OF RELEVANT VARIABLES , 1986 .

[5]  Elana Shohamy,et al.  The Effect of Raters' Background and Training on the Reliability of Direct Writing Tests , 1992 .

[6]  George Engelhard,et al.  The Measurement of Writing Ability With a Many-Faceted Rasch Model , 1992 .

[7]  Joyce M. Bainbridge,et al.  Teachers' gendered expectations and their evaluation of student writing , 1998 .

[8]  J M Linacre,et al.  Investigating rating scale category utility. , 1999, Journal of outcome measurement.

[9]  Georg Rasch,et al.  Probabilistic Models for Some Intelligence and Attainment Tests , 1981, The SAGE Encyclopedia of Research Design.

[10]  Tom Lumley,et al.  Rater characteristics and rater bias: implications for training , 1995 .

[11]  R. L. Ebel,et al.  Estimation of the reliability of ratings , 1951 .

[12]  G. Masters,et al.  Rating Scale Analysis. Rasch Measurement. , 1983 .

[13]  Lynn C. Webb Rater Stringency and Consistency in Performance Assessment. , 1990 .

[14]  George Engelhard,et al.  The Influences of Mode of Discourse, Experiential Demand, and Gender on the Quality of Student Writing. , 1992 .

[15]  Robert J. Mislevy,et al.  Monitoring and Improving a Portfolio Assessment System. , 1995 .

[16]  S. Graham,et al.  Effects of the Learning Disability Label, Quality of Writing Performance, and Examiner's Level of Expertise on the Evaluation of Written Products , 1987, Journal of learning disabilities.

[17]  G. Engelhard,et al.  INVESTIGATING ASSESSOR EFFECTS IN NATIONAL BOARD FOR PROFESSIONAL TEACHING STANDARDS ASSESSMENTS FOR EARLY CHILDHOOD/GENERALIST AND MIDDLE CHILDHOOD/GENERALIST CERTIFICATION , 2000 .

[18]  Edward W. Wolfe,et al.  The Relationship between Scoring Procedures and Focus and the Reliability of Direct Writing Assessment Scores. , 1996 .

[19]  George Engelhard,et al.  Measurement with judges: Many-faceted conjoint measurement , 1994 .

[20]  William E. Coffman,et al.  A Comparison of Two Methods of Reading Essay Examinations , 1968 .

[21]  George Engelhard,et al.  Examining Rater Errors in the Assessment of Written Composition With a Many-Faceted Rasch Model , 1994 .

[22]  George Engelhard,et al.  Rater, Domain, and Gender Influences on the Assessed Quality of Student Writing Using Weighted and Unweighted Scoring. , 1998 .

[23]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[24]  Nigel O'Brian,et al.  Generalizability Theory I , 2003 .

[25]  N. Longford Reliability of Essay Rating and Score Adjustment , 1994 .

[26]  Edward W. Wolfe,et al.  Learning To Rate Essays: A Study of Scorer Cognition. , 1994 .

[27]  Expert/Novice Differences in the Focus and Procedures Used by Essay Scorers. , 1996 .

[28]  George Engelhard,et al.  Evaluating Rater Accuracy in Performance Assessments. , 1996 .

[29]  Walter M. Houston,et al.  Correcting Performance-Rating Errors in Oral Examinations , 1991, Evaluation & the health professions.

[30]  Walter M. Houston,et al.  Detecting and Correcting for Rater Effects in Performance Assessment , 1990 .

[31]  L. Crocker,et al.  Introduction to Classical and Modern Test Theory , 1986 .

[32]  村上 省三,et al.  Technical manual : 日本語版 , 1985 .

[33]  Henry Braun,et al.  Understanding Scoring Reliability: Experiments in Calibrating Essay Readers , 1988 .

[34]  C. Cason,et al.  A Deterministic Theory of Clinical Performance Rating , 1984, Evaluation & the health professions.

[35]  Mark R. Wilson,et al.  An Examination of Variation in Rater Severity Over Time : A Study in Rater Drift , 2000 .

[36]  T. McNamara Measuring Second Language Performance , 1996 .

[37]  Sara Cushing Weigle,et al.  Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches , 1999 .

[38]  Henry Braun CALIBRATION OF ESSAY READERS FINAL REPORT , 1986 .

[39]  H. John Bernardin,et al.  Effects of rater training: Creating new response sets and decreasing accuracy. , 1980 .

[40]  Edward W Wolfe,et al.  Detecting and measuring rater effects using many-facet Rasch measurement: part I. , 2003, Journal of applied measurement.

[41]  M. Lunz,et al.  Examining the Invariance of Rater and Project Calibrations Using a Multi-facet Rasch Model. , 1996 .

[42]  G. Engelhard,et al.  Writing tasks and gender: influences on writing quality of Black and White students , 1994 .

[43]  Brent Bridgeman,et al.  RELIABILITY OF ADVANCED PLACEMENT EXAMINATIONS , 1996 .

[44]  Mary E. Lunz,et al.  Measuring the Impact of Judge Severity on Examination Scores , 1990 .

[45]  Chockalingam Viswesvaran,et al.  Least Squares Models to Correct for Rater Effects in Performance Assessment , 1993 .

[46]  Walter M. Houston,et al.  Adjustments for Rater Effects in Performance Assessment , 1991 .

[47]  B. Huot,et al.  Validating holistic scoring for writing assessment : theoretical and empirical foundations , 1993 .

[48]  Parameter Estimation for Peer Grading under Incomplete Design , 1988 .

[49]  G. Engelhard,et al.  Gender Differences in Performance on Multiple-Choice and Constructed Response Mathematics Items. , 1999 .

[50]  Mary E. Lunz,et al.  Judge Consistency and Severity Across Grading Periods , 1990 .

[51]  M. Raymond Missing Data in Evaluation Research , 1986 .

[52]  Lawrence M. Rudner Reducing Errors Due to the Use of Judges , 1992 .

[53]  K. Ercikan,et al.  The Consistency Between Raters Scoring in Different Test Years , 1998 .

[54]  G Engelhard Constructing rater and task banks for performance assessments. , 1997, Journal of outcome measurement.

[55]  D. McArthur Bias in the Writing of Prose and Its Appraisal. , 1981 .

[56]  Charles E. Lance,et al.  A Test of the Context Dependency of Three Causal Models of Halo Rater Error , 1994 .

[57]  B. Wright,et al.  Best test design , 1979 .

[58]  Nicholas T. Longford A Case for Adjusting Subjectively Rated Scores in the Advanced Placement Tests. Program Statistics Research. Technical Report No. 94-5. , 1994 .

[59]  D. Andrich A rating formulation for ordered response categories , 1978 .

[60]  D. D. Gruijter Two simple models for rater effects. , 1984 .

[61]  Stephen B. Dunbar,et al.  Complex, Performance-Based Assessment: Expectations and Validation Criteria , 1991 .

[62]  Carol M. Myford,et al.  READER CALIBRATION AND ITS POTENTIAL ROLE IN EQUATING FOR THE TEST OF WRITTEN ENGLISH , 1995 .