If criterion-referenced testing is to achieve its full potential in situations as far ranging as teacher classroom assessments and professional certification examinations, criterionreferenced test scores must lead to decisions which are "consistent" across parallel-form (or retest) administrations of a test (Hambleton, Swaminathan, Algina, & Coulson, 1978). That is, a high percentage of examinees must be classified in the same mastery category or state by a parallel-form (or a readministration) of the test or the resulting decisions will be of limited usefulness. Unfortunately, there are few methods or guidelines available to assist test developers in determining the number of items required to achieve a desired level of decision consistency. Many available methods for determining the lengths for criterion-referenced tests are either based on unreasonable assumptions, are highly conservative, or fail to consider important factors (see Wilcox (1980) for a review of methods). For example, the well-known generalized Spearman-Brown formula can be used for determining the appropriate length of a norm-referenced test, but the formula is of limited value in the construction of criterion-referenced tests. The Spearman-Brown formula incorporates the correlation between scores on parallel forms of a test whereas with criterion-referenced tests, interest is often centered on the consistency of decision-making across parallel-form (retest) administrations of a test. The "correlation of scores" and "consistency of decisions resulting from the use of scores" will, in general, have different values. In fact, even for a given set of test items and group of examinee responses, the index of decision consistency will vary with the chosen cut-off score. Consider, for example, three popular methods for determining the lengths of criterion-referenced tests. The first two methods (Millman, 1973; Wilcox, 1976) focus on the individual examinee and determine test lengths to insure minimum probabilities of correct classification. The third method (Eignor and Hambleton, 1979) focuses on a group of examinees and determines test lengths to insure some overall level of decision consistency and accuracy. Millman's method requires the use of the simple binomial test model and an approximate true score for each examinee who will be tested. But, since the purpose of testing is often to assess examinee true scores, on many occasions suitable true score estimates will not be available. Also, unless a test is to be administered at a computer terminal, it is usually not practical to permit test lengths to vary from one examinee to the next. Wilcox's method, unlike Millman's, does result in the determination of a single test length for the total group of examinees. His method has considerable intuitive appeal but it also leads to the selection of highly conservative test lengths. That is, for many of the examinees in the examinee group of interest, considerably longer tests are used than are needed to achieve the desired probabilities of correct classifications.
[1]
Jason Millman,et al.
Passing Scores and Test Lengths for Domain-Referenced Measures
,
1973
.
[2]
R. Wilcox.
Determining the Length of a Criterion-Referenced Test
,
1980
.
[3]
Rand R. Wilcox,et al.
A Note on the Length and Passing Score of a Mastery Test
,
1976
.
[4]
R. Hambleton,et al.
Effects of Test Length and Advancement Score on Several Criterion-Referenced Test Reliability and Validity Indices. Laboratory of Psychometric and Evaluation Research Report No. 86.
,
1979
.
[5]
R. Traub,et al.
Reliability of Test Scores and Decisions
,
1980
.
[6]
B. Wright,et al.
Best test design
,
1979
.
[7]
F. Lord.
Applications of Item Response Theory To Practical Testing Problems
,
1980
.
[8]
Ronald K. Hambleton,et al.
Criterion-Referenced Testing and Measurement: A Review of Technical Issues and Developments
,
1978
.