Influence of Item Parameter Estimation Errors in Test Development.

Item response models are finding increasing use in achievement and aptitude test development. Item response theory (IRT) test development involves the selection of test items based on a consideration of their item information functions. But a problem arises because item information functions are determined by their item parameter estimates, which contain error. When the "best" items are selected on the basis of their statistical characteristics, there is a tendency to capitalize on chance due to errors in the item parameter estimates. The resulting test, therefore, falls short of the test that was desired or expected. The purposes of this article are (a) to highlight the problem of item parameter estimation errors in the test development process, (b) to demonstrate the seriousness of the problem with several simulated data sets, and (c) to offer a conservative solution for addressing the problem in IRT-based test development. Over the last 20 years, many test developers have used item response theory (IRT) models and methods rather than classical measurement models in their test development and related technical work (Hambleton, 1989; Hambleton & Swaminathan, 1985; Lord, 1980). Item response theory, particularly as reflected in the one-, two-, and three-parameter logistic models for dichotomously scored items, is receiving increasing attention from test developers in test design and test item selection, in addressing item bias, in computeradaptive testing, and in the equating and reporting of test scores. Many major test publishers, state departments of education, and large school districts currently use IRT models in some capacity in their testing work. One problem that arises when applying IRT models in test development involves capitalizing on chance due to positive errors in some item parameter estimates. As a result, tests may often fall short, statistically, of what is expected, and standard errors associated with ability estimates may be correspondingly underestimated (if the inflated item parameter estimates are used), which will lead to overconfidence in the ability estimates. The same problem arises in item selection in computerized adaptive testing. Perhaps it should be The research described in this article was funded by the Graduate Management Admission Council. The GMAC encourages researchers to formulate and freely express their own opinions, and the opinions expressed here are not necessarily those of the GMAC. This article was presented at the meeting of APA, Boston, 1990. The authors benefited considerably from the suggestions of two anonymous reviewers and the editor.