Use of Response Process Data to Inform Group Comparisons and Fairness Research

ABSTRACT Comparing group is one of the key uses of large-scale assessment results, which are used to gain insights to inform policy and practice and to examine the comparability of scores and score meaning. Such comparisons typically focus on examinees’ final answers and responses to test questions, ignoring response process differences groups may engage in. This paper discusses and demonstrates the use of response process data in enhancing group comparison and fairness research methodologies. We propose two statistical approaches for identifying differential response processes which extend the differential item functioning (DIF) detection methods and demonstrate the complementary use of process data in comparing groups in two case studies Our findings demonstrate the use of response process data in gaining insights about students’ test-taking behaviors from different populations that go beyond what may be identified using response data only.

[1]  Dorothy T. Thayer,et al.  DIFFERENTIAL ITEM FUNCTIONING AND THE MANTEL‐HAENSZEL PROCEDURE , 1986 .

[2]  Neil J. Dorans,et al.  Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. , 1986 .

[3]  R. Linn Educational measurement, 3rd ed. , 1989 .

[4]  R. Zwick When Do Item Response Function and Mantel-Haenszel Definitions of Differential Item Functioning Coincide? , 1990 .

[5]  Michael J. Zieky,et al.  Practical questions in the use of DIF statistics in test development. , 1993 .

[6]  Rebecca Zwick,et al.  Evaluating the Magnitude of Differential Item Functioning in Polytomous Items , 1996 .

[7]  P. Black,et al.  Assessment and Classroom Learning , 1998 .

[8]  K. Ercikan,et al.  Translation effects in international assessments , 1998 .

[9]  Edward H. Haertel Validity Arguments for High‐stakes Testing: In Search of the Evidence , 2005 .

[10]  Steven L. Wise,et al.  Response Time Effort: A New Measure of Examinee Motivation in Computer-Based Tests , 2005 .

[11]  Alija Kulenović,et al.  Standards for Educational and Psychological Testing , 1999 .

[12]  Lynn Shafer Willner,et al.  Decision-Making Practices of Urban Districts for Including and Accommodating English Language Learners in NAEP--School-Based Perspectives. , 2007 .

[13]  José F. Domene,et al.  Application of Think Aloud Protocols for Examining and Confirming Sources of Differential Item Functioning Identified by Expert Reviews. , 2010 .

[14]  J. Herman,et al.  Assessing English Language Learners’ Opportunity to Learn Mathematics: Issues and Limitations , 2010, Teachers College Record: The Voice of Scholarship in Education.

[15]  M. Oliveri,et al.  Do Different Approaches to Examining Construct Comparability in Multilanguage Assessments Lead to Similar Conclusions? , 2011 .

[16]  Willem J. van der Linden,et al.  Test Design and Speededness , 2011 .

[17]  The Origins of Procedures for Using Differential Item Functioning Statistics at Educational Testing Service , 2011 .

[18]  Rebecca Zwick,et al.  A Review of ETS Differential Item Functioning Assessment Procedures: Flagging Rules, Minimum Sample Size Requirements, and Criterion Refinement , 2012 .

[19]  Shumin Zhai,et al.  Multilingual Touchscreen Keyboard Design and Optimization , 2012, Hum. Comput. Interact..

[20]  N. Dorans ETS CONTRIBUTIONS TO THE QUANTITATIVE ASSESSMENT OF ITEM, TEST, AND SCORE FAIRNESS , 2013 .

[21]  Silvia Wen-Yu Lee,et al.  A review of using eye-tracking technology in exploring learning from 2000 to 2012 , 2013 .

[22]  K. Ercikan,et al.  Adapting tests for use in other languages and cultures. , 2013 .

[23]  Kasey Garrison,et al.  “The University for the Poor”: Portrayals of Class in Translated Children's Literature , 2015, Teachers College Record: The Voice of Scholarship in Education.

[24]  Wolff‐Michael Roth,et al.  Reading Proficiency and Comparability of Mathematics and Science Scores for Students From English and Non-English Backgrounds: An International Perspective , 2015 .

[25]  Johannes Naumann,et al.  More is not Always Better: The Relation between Item Response and Item Response Time in Raven's Matrices , 2015 .

[26]  K. Rayner,et al.  Emerging issues in developmental eye-tracking research: Insights from the workshop in Hannover, October 2013 , 2015 .

[27]  Wolff‐Michael Roth,et al.  Cautions about Inferences from International Assessments: The Case of PISA 2009 , 2015, Teachers College Record: The Voice of Scholarship in Education.

[28]  Shelby J. Haberman,et al.  A New Procedure for Detection of Students’ Rapid Guessing Responses Using Response Time , 2016 .

[29]  Matthias von Davier,et al.  Analyzing Process Data from Problem-Solving Items with N-Grams: Insights from a Computer-Based Large-Scale Assessment , 2016 .

[30]  Erik D. Reichle,et al.  Eye movements in reading and information processing: Keith Rayner’s 40 year legacy , 2016 .

[31]  Jacqueline P. Leighton Collecting and Analyzing Verbal Response Process Data in the Service of Interpretive and Validity Arguments 1 , 2017 .

[32]  Ou Lydia Liu,et al.  Evaluating the Impact of Careless Responding on Aggregated-Scores: To Filter Unmotivated Examinees or Not? , 2017 .

[33]  Yue Jia,et al.  Collecting, Analyzing, and Interpreting Response Time, Eye-Tracking, and Log Data , 2017 .

[34]  Steven L. Wise,et al.  Rapid-Guessing Behavior: Its Identification, Interpretation, and Implications , 2017 .

[35]  Bryan Maddox,et al.  Observing response processes with eye tracking in international large-scale assessments: evidence from the OECD PIAAC assessment , 2018, European Journal of Psychology of Education.

[36]  N. Dorans,et al.  Using Weighted Sum Scores to Close the Gap Between DIF Practice and Theory , 2020, Journal of Educational Measurement.

[37]  Reza Neiriz Ercikan, K., & Pellegrino, J. W. Eds. (2017). Validation of Score Meaning for the Next Generation of Assessments: The Use of Response Processes. New York, NY: Routledge. , 2020, Journal of Educational Measurement.