Item response models are finding increasing use in achievement and aptitude test development. Item response theory (IRT) test development involves the selection of test items based on a consideration of their item information functions. But a problem arises because item information functions are determined by their item parameter estimates, which contain error. When the "best" items are selected on the basis of their statistical characteristics, there is a tendency to capitalize on chance due to errors in the item parameter estimates. The resulting test, therefore, falls short of the test that was desired or expected. The purposes of this article are (a) to highlight the problem of item parameter estimation errors in the test development process, (b) to demonstrate the seriousness of the problem with several simulated data sets, and (c) to offer a conservative solution for addressing the problem in IRT-based test development. Over the last 20 years, many test developers have used item response theory (IRT) models and methods rather than classical measurement models in their test development and related technical work (Hambleton, 1989; Hambleton & Swaminathan, 1985; Lord, 1980). Item response theory, particularly as reflected in the one-, two-, and three-parameter logistic models for dichotomously scored items, is receiving increasing attention from test developers in test design and test item selection, in addressing item bias, in computeradaptive testing, and in the equating and reporting of test scores. Many major test publishers, state departments of education, and large school districts currently use IRT models in some capacity in their testing work. One problem that arises when applying IRT models in test development involves capitalizing on chance due to positive errors in some item parameter estimates. As a result, tests may often fall short, statistically, of what is expected, and standard errors associated with ability estimates may be correspondingly underestimated (if the inflated item parameter estimates are used), which will lead to overconfidence in the ability estimates. The same problem arises in item selection in computerized adaptive testing. Perhaps it should be The research described in this article was funded by the Graduate Management Admission Council. The GMAC encourages researchers to formulate and freely express their own opinions, and the opinions expressed here are not necessarily those of the GMAC. This article was presented at the meeting of APA, Boston, 1990. The authors benefited considerably from the suggestions of two anonymous reviewers and the editor.
[1]
R. L. Thorndike.
Personnel selection : test and measurement techniques
,
1951
.
[2]
Melvin R. Novick,et al.
Some latent train models and their use in inferring an examinee's ability
,
1966
.
[3]
F. Lord.
Applications of Item Response Theory To Practical Testing Problems
,
1980
.
[4]
Fritz Drasgow,et al.
Recovery of Two- and Three-Parameter Logistic Item Characteristic Curves: A Monte Carlo Study
,
1982
.
[5]
R. Hambleton,et al.
Item Response Theory: Principles and Applications
,
1984
.
[6]
R. Hambleton,et al.
Item Response Theory
,
1984,
The History of Educational Measurement.
[7]
Experiences in the Application of Item Response Theory in Test Construction
,
1989
.
[8]
R. Hambleton.
Principles and selected applications of item response theory.
,
1989
.