Comparison of Empirical and Judgmental Procedures for Detecting Differential Item Functioning.

The purpose of this study was to improve both statistical and judgmental methods for detecting potentially biased items in a test in an attempt to examine the agreement between the results obtained with these methods. If greater agreement between methods can be achieved, test items can be more effectively screened using judgmental methods prior to field testing or actual test administrations. Steps were taken to address several methodological shortcomings of current empirical and judgmental methods. The test data came from samples of 2,000 Native American and 2,000 Anglo-American students who took a 150-item Statewide Proficiency Test. Fifteen Native American educators provided item bias reviews. The results suggest that a somewhat higher level of agreement between methods was obtained than has been observed in other studies. The use of cross-validation in empirically identifying potentially biased items was one reason for the higher level of agreement. However, the judgmental method implemented in this study appeared to have several shortcomings. Practical implications of the findings are presented. LR231 U.S. DEPARTMENT Of EDUCATION Moe of Educahonal Research end Improvement EDUCATIONAL RESOURCES INFORMATION CENTER (ERIC) This document has been reproduced as received front the person or organization originating it. 0 Minor changes have been made lo Improve reproduction OtPahlY Fonts ot view or opinions stated in this document do not necessarily represent official OERI postilion or policy "PERMISSION TO REPRODUCE THIS MATERIAL HAS BEEN GRANTED BY iee(kg AMBLEk) TO THE EDUCATIONAL RESOURCES INFORMATION CENTER (ERIC)." 2 BEST COPY AVAILABLE Comparison of Empirical and Judgmental Methods for Detecting DifferenziAl Item Functionine2 Ronald K. Hambleton and Russell W. Jones3 University of Massachusetts at Amherst Paper-and-pencil tests are widely used as tools for selection, promotion, certification and licensure decisions throughout education, business, the armed services, and industry. As test use for important decisions has increased, the issue of item bias has achieved considerable significance. Test developers must now demonstrate that their tests are free of item bias. To this end, various judgmental and empirical methods for detecting potentially biased items have been proposed (see, for example, Berk, 1982; Hills, 1989; Scheuneman & Bleist(in, 1989). These "DIF" studies, as they are commonly called, are designed to detect differential item functioning (DIF) between reference and focal groups. Typically, judgmental and empirical methods for detecting differentially functioning items have shown little agreement (Plake, 1980; Engelhard, Hansche, & Rutledge, 1990). A partial explanation for this low agreement is that the judgmental review forms are sometimes focused on cultural and sexual stereotyping in items .rather than on factors which may lead to differential performance between subgroups of interest (Scheuneman, 1982). As a result, many undesirable item, are identified in the item bias review process, such as those which may show members of minority groups doing 1Laboratory of Psychometric and Evaluative Research Report No. 231. Amherst, MA: University of Massachusetts at Amhest, School of Education. 2Paper presented at the meeting of NCME, San Francisco, 1992. 3The authors wish to thank the 15 Native American educators who completed the item review task, and John Martois for his assistance in locating reviewers.