The Role of Reliability in Criterion-Referenced Tests.

In discussion of the properties of criterion-referenced tests, it is often assumed that traditional reliability indices, particularly those based on internal consistency, are not relevant. However, if the measurement errors involved in using an individual's observed score on a criterionreferenced test to estimate his or her universe scores on a domain of items are compared to errors of an a priori procedure that assigns the same universe score (the mean observed test score) to all persons, the test-based procedure is found to improve the accuracy of universe score estimates only if the test reliability is above 0.5. This suggests that criterion-referenced tests with low reliabilities generally will have limited use in estimating universe scores on domains of items. The assumption that reliability indices, based on internal consistency, are not particularly relevant to criterion-referenced testing can be traced, at least in part, to a seminal article by Popham and Husek (1969). They argued that since criterion-referenced tests are designed to determine a person's achievement compared to some criterion, the meaning of the score should not depend on the scores of other people. Therefore, Popham and Husek concluded that "variability is not a necessary condition for a good criterion-referenced test," (p. 3) and that reliability indices based on score variability "are not only irrelevant to criterionreferenced uses, but are actually injurious to their proper development and use" (p. 4). The analyses presented in this paper suggest that reliability is an important issue in criterion-referenced testing. In particular, these analyses suggest that if a criterion-referenced test had a reliability (defined in terms of internal consistency) below 0.5, a simple a priori procedure would provide better estimates of students' universe scores than would individual observed scores. These analyses are not intended to demonstrate that criterion-referenced tests should be evaluated exclusively, or even principally, by internal consistency reliability coefficients, but they do suggest that such coefficients relate to the usefulness of the tests in estimating universe scores. Errors of Measurement Assume that we have a domain of items and a group of people, and that we draw a random sample of items from the domain and administer the resulting test to the group. Individuals' observed scores can then be used to estimate their universe score, the proportion of items in the domain that they could answer correctly if they responded to all the items. Using generalizability theory, we can represent the