A Didactic Explanation of Item Bias, Item Impact, and Item Validity from a Multidimensional Perspective

Many researchers have suggested that the main cause of item bias is the misspecification of the latent ability space, where items that measure multiple abilities are scored as though they are measuring a single ability. If two different groups of examinees have different underlying multidimensional ability distributions and the test items are capable of discriminating among levels of abilities on these multiple dimensions, then any unidimensional scoring scheme has the potential to produce item bias. It is the purpose of this article to provide the testing practitioner with insight about the difference between item bias and item impact and how they relate to item validity. These concepts will be explained from a multidimensional item response theory (MIRT) perspective. Two detection procedures, the MantelHaenszel (as modified by Holland and Thayer, 1988) and Shealy and Stout's Simultaneous Item Bias (SIB; 1991) strategies, will be used to illustrate how practitioners can detect item bias. The purpose of most standardized tests is to distinguish between ability levels of examinees and thereby rank order individuals. To rank examinees accurately requires that all of the items in a test discriminate between levels of the same purported ability. Problems are encountered when a test contains items that discriminate between levels of several different abilities or several different composites of abilities. Unfortunately, because ordering is a unidimensional concept, researchers cannot order examinees on two or more abilities at the same time, unless they base their ranking on, for example, the weighted sum of each skill being measured. Specifically, if a test is multidimensional, there is no unique one-to-one mapping between an examinee's estimated unidimensional ability and an examinee's underlying composite of abilities. To study the relationship between an examinee's latent ability and the probability of a correct response, researchers can use probabilistic models of item response theory (IRT) that describe the interaction of an examinee's level of ability and the difficulty and discrimination parameters of an item. In most cases, practitioners use unidimensional IRT models, even though an examinee's score may reflect a composite of abilities. Problems related to this model misspecification have plagued psychometricians for years, especially when they try to model cognitive processes (cf., Traub, 1983). It is important that practitioners who use IRT models realize that items and examinees interact and that this interaction needs to be closely examined. The interaction between a group of examinees and items on a test may be unidimen