Classification accuracy in Key Stage 2 National Curriculum tests in England

The accuracy of the results of the national tests in English, mathematics and science taken by 11-year olds in England has been a matter of much debate since their introduction in 1994, with estimates of the proportion of students incorrectly classified varying from 10 to 30%. Using live data from the 2009 and 2010 administration of the national tests, this paper uses a number of models, drawing on both classical and modern test theories, to explore the relationship between test reliability, and the extent of misclassification when a student’s test score is reported in terms of one of a small number of discrete levels of achievement. The results indicate that across the two cohorts (2009 and 2010) and six models, the averages of classification accuracy of the tests were about 85%, 90% and 87% in English, mathematics and science, respectively. Moreover, the different models yielded very similar results; the standard deviations of the values of classification accuracy generated were 1.9% for English, 1.0% for mathematics and 1.3% for science.

[1]  G. Masters,et al.  Rating Scale Analysis. Rasch Measurement. , 1983 .

[2]  L. Harvill,et al.  Standard Error of Measurement , 1991 .

[3]  D. Andrich Rating Scale Analysis , 1999 .

[4]  Won-Chan Lee,et al.  Classification Consistency and Accuracy for Complex Assessments Using Item Response Theory , 2010 .

[5]  Bradley A. Hanson A Comparison of Presmoothing and Postsmoothing Methods in Equipercentile Equating. ACT Research Report Series 94-4. , 1994 .

[6]  R. Hambleton,et al.  Fundamentals of Item Response Theory , 1991 .

[7]  Robert L. Brennan,et al.  Center for Advanced Studies in Measurement and Assessment , 2009 .

[8]  Willem J. van der Linden,et al.  Book reviews: Applying the Rasch Model , 2001 .

[9]  Paul Black,et al.  The Reliability of assessments , 2012 .

[10]  Qingping He,et al.  The reliability programme: final report , 2011 .

[11]  Bo Zhang,et al.  Investigating Proficiency Classification for the Examination for the Certificate of Proficiency in English (ECPE) , 2008 .

[12]  L. S. Feldt,et al.  A Comparison of Five Methods for Estimating the Standard Error of Measurement at Specific Score Levels , 1985 .

[13]  M. R. Novick,et al.  Statistical Theories of Mental Test Scores. , 1971 .

[14]  P. Newton The reliability of results from national curriculum testing in England , 2009 .

[15]  B. Hanson Method of Moments Estimates for the Four-Parameter Beta Compound Binomial Model and the Calculation of Classification Consistency Indexes , 1991 .

[16]  G. Bolton Reliability , 2003, Medical Humanities.

[17]  F. Lord Applications of Item Response Theory To Practical Testing Problems , 1980 .

[18]  Lyle F. Bachman,et al.  语言测试实践 = Language testing in practice , 1998 .

[19]  L. Cronbach Coefficient alpha and the internal structure of tests , 1951 .

[20]  M. R. Espejo Applying the Rasch Model: Fundamental Measurement in the Human Sciences , 2004 .

[21]  Lyle F. Bachman Statistical analyses for language assessment , 2004 .

[22]  D. Wiliam Reliability, validity, and all that jazz , 2001 .

[23]  Shameem Nyla NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION , 2004 .

[24]  D. Eignor The standards for educational and psychological testing. , 2013 .

[25]  Lawrence M. Rudner Computing the Expected Proportions of Misclassified Examinees. , 2001 .

[26]  Charles Lewis,et al.  Estimating the Consistency and Accuracy of Classifications Based on Test Scores , 1993 .

[27]  Audrey L. Quails-Payne A Comparison of Score Level Estimates of the Standard Error of Measurement , 1992 .

[28]  G. Masters A rasch model for partial credit scoring , 1982 .

[29]  F. Lord Estimating true-score distributions in psychological testing (an empirical bayes estimation problem) , 1969 .

[30]  L. Crocker,et al.  Introduction to Classical and Modern Test Theory , 1986 .

[31]  Georg Rasch,et al.  Probabilistic Models for Some Intelligence and Attainment Tests , 1981, The SAGE Encyclopedia of Research Design.

[32]  J. Gardner,et al.  The fallibility of high stakes ‘11‐plus’ testing in Northern Ireland , 2005 .

[33]  R. Hambleton,et al.  Item Response Theory , 1984, The History of Educational Measurement.

[34]  R. Traub,et al.  NCME Instructional Module: Understanding Reliability. , 1991 .

[35]  Lawrence M. Rudner Expected Classification Accuracy , 2005 .