Toward Increasing Fairness in Score Scale Calibrations Employed in International Large-Scale Assessments

In this article, we investigate the creation of comparable score scales across countries in international assessments. We examine potential improvements to current score scale calibration procedures used in international large-scale assessments. Our approach seeks to improve fairness in scoring international large-scale assessments, which often ignore item misfit in score scale calibrations. We also seek to obtain improved model-data fit estimates when calibrating international score scales. To this end, we examine the use of two alternative score scale calibration procedures: (a) a language-based score scale and (b) a more parsimonious international scale wherein a large proportion of international parameters are used with a subset of country-based parameters for items that misfit in the international scale. In our analyses, we used data from all 40 countries participating in the Progress in International Reading Literacy Study. Our findings revealed that current score scale calibration procedures yield large numbers of misfitting items (higher than 25% for some countries). Our proposed approach diminished the effects of proportion of item misfit on score scale calibrations and also yielded enhanced model-data fit estimates. These results lead to enhancing confidence in measurements obtained from international large-scale assessments.

[1]  Peter J. Fensham,et al.  Programme for International Student Assessment (PISA) , 2014 .

[2]  George B. Macready,et al.  The Use of Loglinear Models for Assessing Differential Item Functioning Across Manifest and Latent Examinee Groups , 1990 .

[3]  J. Rost,et al.  Applications of Latent Trait and Latent Class Models in the Social Sciences , 1998 .

[4]  M. Ruhlen A Guide to the World’s Languages , 1987 .

[5]  Matthias von Davier,et al.  Investigation of model fit and score scale comparability in international assessments , 2011 .

[6]  W. Molenaar,et al.  Lenient or strict application of IRT with an eye on practical consequences , 1997 .

[7]  K. Ercikan,et al.  Examining the Construct Comparability of the English and French Versions of TIMSS , 2005 .

[8]  Ronald K. Hambleton,et al.  Identifying the causes of DIF in translated verbal items , 1999 .

[9]  B. Zumbo,et al.  Uncovering Substantive Patterns in Student Responses in International Large-Scale Assessments—Comparing a Latent Class to a Manifest DIF Approach , 2014 .

[10]  Matthias von Davier,et al.  Polytomous Mixed Rasch Models , 1995 .

[11]  Akihito Kamata,et al.  Multilevel Rasch Models , 2007 .

[12]  S. Tse,et al.  Progress in International Reading Literacy Study 2006 (PIRLS): pedagogical correlates of fourth-grade students in Hong Kong , 2009 .

[13]  R. Hambleton,et al.  Adapting educational and psychological tests for cross-cultural assessment , 2004 .

[14]  Robert J. Mislevy,et al.  Modeling item responses when different subjects employ different solution strategies , 1990 .

[15]  B. Zumbo Does item-level DIF manifest itself in scale-level analyses? Implications for translating language tests , 2003 .

[16]  M. Oliveri,et al.  Do Different Approaches to Examining Construct Comparability in Multilanguage Assessments Lead to Similar Conclusions? , 2011 .

[17]  Matthias von Davier,et al.  PARTIALLY OBSERVED MIXTURES OF IRT MODELS: AN EXTENSION OF THE GENERALIZED PARTIAL CREDIT MODEL , 2003 .

[18]  สุภาวดี คำนาดี,et al.  Multivariate and Mixture Distribution Rasch Models: Extensions and Applications , 2015 .

[19]  J. H. Steiger Structural Model Evaluation and Modification: An Interval Estimation Approach. , 1990, Multivariate behavioral research.

[20]  D. Eignor The standards for educational and psychological testing. , 2013 .

[21]  Robert J. Mislevy,et al.  Evidentiary foundations of mixture item response theory models , 2006 .

[22]  Merritt Ruhlen,et al.  A Guide to the World's Languages, Vol. 1: Classification, 2nd Edn , 1993 .

[23]  A. Grisay,et al.  Measuring the Equivalence of Item Difficulty in the Various Versions of an International Test. , 2007 .

[24]  R. Adams,et al.  The Programme for International Student Assessment: an overview. , 2007, Journal of applied measurement.

[25]  B. Zumbo,et al.  Analysis of Sources of Latent Class Differential Item Functioning in International Assessments , 2013 .

[26]  Marie-Line Simon,et al.  Évaluation des systèmes éducatifs , 2014 .

[27]  H. Akaike A new look at the statistical model identification , 1974 .

[28]  Pierre Foy,et al.  PIRLS 2006 International Report IEA's Progress in International Reading Literacy Study in Primary Schools in 40 Countries , 2007 .

[29]  B. Zumbo,et al.  Methodologies for Investigating Item- and Test-Level Measurement Equivalence in International Large-Scale Assessments , 2012 .

[30]  Assessment Framework and Specifications (2nd Edition). PIRLS 2006. , 2006 .

[31]  Dana L. Kelly,et al.  International Association for the Evaluation of Educational Achievement , 1998 .

[32]  Shelby J. Haberman,et al.  Conditional Log-Linear Models for Analyzing Categorical Panel Data , 1994 .

[33]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[34]  H. Wainer,et al.  Differential Testlet Functioning: Definitions and Detection , 1991 .

[35]  Identifiers California,et al.  Annual Meeting of the National Council on Measurement in Education , 1998 .

[36]  Matthias von Davier,et al.  A Unified Approach to IRT Scale Linking and Scale Transformations , 2004 .

[37]  A. Grisay,et al.  Equivalence of item difficulties across national versions of the PIRLS and PISA reading assessements , 2009 .

[38]  Jürgen Rost,et al.  Rasch Models in Latent Classes: An Integration of Two Approaches to Item Analysis , 1990 .

[39]  Ina V. S. Mullis,et al.  Progress in International Reading Literacy Study (PIRLS): PIRLS 2006 Technical Report. , 2007 .

[40]  I. W. Molenaar,et al.  Rasch models: foundations, recent developments and applications , 1995 .