In 1993, we reported in Journal of Educational Measurement that task-sampling variability was the Achilles’ heel of science performance assessment. To reduce measurement error, tasks needed to be stratified before sampling, sampled in large number, or possibly both. However, Cronbach, Linn, Brennan, & Haertel (1997) pointed out that a task-sampling interpretation of a large person x task variance component might be incorrect. Task and occasion sampling are confounded because tasks are typically given on only a single occasion. The person x task source of measurement error is then confounded with the pt x occasion source. If pto variability accounts for a substantial part of the commonly observed pt interaction, stratifying tasks into homogenous subsets—a cost-effective way of addressing task sampling variability—might not increase accuracy. Stratification would not address the pro source of error. Another conclusion reported in JEM was that only direct observation (DO) and notebook (NB) methods of collecting performance assessment data were exchangeable; computer simulation, short-answer, and multiple-choice methods were not. However, if Cronbach et al. were right, our exchangeability conclusion might be incorrect. After re-examining and re-analyzing data, we found support for Conbach et al. We concluded that large task-sampling variability was due to both the person x task interaction and person x task x occasion interaction. Moreover, we found that direct observation, notebook and computer simulation methods were equally exchangeable, but their exchangeability was limited by the volatility of student performances across tasks and occasions.
[1]
Maridyth M. McBee,et al.
The Generalizability of a Performance Assessment Measuring Achievement in Eight-Grade Mathematics
,
1998
.
[2]
Stephen P. Klein,et al.
The Cost of Science Performance Assessments in Large-Scale Testing Programs
,
1997
.
[3]
Xiaohong Gao,et al.
Generalizability of Large-Scale Performance Assessments in Science: Promises and Problems
,
1994
.
[4]
Edward H. Haertel,et al.
Generalizability Analysis for Performance Assessments of Student Achievement or School Effectiveness
,
1997
.
[5]
Gail P. Baxter,et al.
Science performance assessments: benchmarks and surrogates
,
1994
.
[6]
R. Shavelson,et al.
Sampling Variability of Performance Assessments.
,
1993
.
[7]
Guillermo Solano-Flores,et al.
On the development and evaluation of a shell for generating science performance assessments
,
1999
.
[8]
R. Shavelson.
Performance Assessment in Science
,
1991
.
[9]
R. Shavelson,et al.
Rhetoric and reality in science performance assessments: An update.
,
1996
.
[10]
Maria Araceli Ruiz-Primo,et al.
On the Stability of Performance Assessments
,
1993
.