Psychometric Characteristics of Assessment Procedures

In this chapter by Wasserman and Bracken, the most salient psychometric characteristics of psychological tests are described, incorporating elements from both classical test theory and item response theory. Guidelines are provided for the evaluation of test technical adequacy across a wide array of psychological tests in the areas of sampling, norming, scaling, validity, reliability, and fairness. Sampling, norming, and scaling guidelines address appropriate sample sizes, the accuracy and recency of norms, item and scale gradients, and floor and ceiling effects. Evidence of test score validity is described as occurring in two broad classes (internal and external), both of which ultimately are concerned with construct validity. Evidence of test score reliability includes the extent to which measurement results are precise and accurate; free from random and unexplained error; and consistent, accurate, and uniform across occasions, time, observers, and samples. Fairness is described as an important additional area of psychometric emphasis. A holistic and systemic approach to test score fairness is described, extending from test conception through applied consequences. Finally, the study of psychometrics is criticized for its historical over-reliance on internal sources of evidence, and recommendations are made for an increased focus on external, applied sources of evidence of psychometric adequacy. Keywords: fairness; norms; psychometric; reliability; sampling; test; validity

[1]  J. Flynn Massive IQ gains in 14 nations: What IQ tests really measure. , 1987 .

[2]  M. Nezworski,et al.  The Comprehensive System for the Rorschach: A Critical Examination , 1996 .

[3]  M. J. Kolen,et al.  Conditional Standard Errors of Measurement for Scale Scores Using IRT , 1996 .

[4]  S. Embretson The new rules of measurement. , 1996 .

[5]  B. Bracken Limitations of Preschool Instruments and Standards for Minimal Levels of Technical Adequacy , 1987 .

[6]  L. Cronbach The two disciplines of scientific psychology. , 1957 .

[7]  John A. Swets The science of choosing the right decision threshold in high-stakes diagnostics. , 1992 .

[8]  W. Meredith Measurement invariance, factor analysis and factorial invariance , 1993 .

[9]  M. Pomplun State Assessment and Instructional Change: A Path Model Analysis , 1997 .

[10]  D. Cicchetti Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology. , 1994 .

[11]  R. Linn Partitioning Responsibility for the Evaluation of the Conseqyences of Assessment Programs , 2005 .

[12]  H. Wainer,et al.  Teacher’s Corner: Toward a Coherent View of Reliability in Test Theory , 1997 .

[13]  S. Messick Validity of Psychological Assessment: Validation of Inferences from Persons' Responses and Performances as Scientific Inquiry into Score Meaning. Research Report RR-94-45. , 1994 .

[14]  B. Bracken Ten Psychometric Reasons Why Similar Tests Produce Dissimilar Results. , 1988 .

[15]  S. Haynes,et al.  Content validity in psychological assessment: A functional approach to concepts and methods. , 1995 .

[16]  A. Lazarus MULTIMODAL BEHAVIOR THERAPY: TREATING THE “BASIC ID” , 1973, Journal of Nervous and Mental Disease.

[17]  F. Floyd,et al.  Factor analysis in the development and refinement of clinical assessment instruments. , 1995 .

[18]  Joseph L. Fleiss,et al.  Balanced Incomplete Block Designs for Inter-Rater Reliability Studies , 1981 .

[19]  J. Mercer What Is a Racially and Culturally Nondiscriminatory Test , 1984 .

[20]  D. Tulsky,et al.  Updating to the WAIS-III and WMS-III: considerations for research and clinical practice. , 2000, Psychological assessment.

[21]  D. Weiss,et al.  Interrater reliability and agreement of subjective judgments , 1975 .

[22]  T. Vacha-Haase,et al.  Reliability Generalization: Exploring Variance in Measurement Error Affecting Score Reliability Across Studies , 1998 .

[23]  John D. Cone,et al.  The behavioral assessment grid (BAG): A conceptual framework and a taxonomy , 1978 .

[24]  Mahzarin R. Banaji,et al.  The Bankruptcy of Everyday Memory , 1989 .

[25]  R O Nelson,et al.  The treatment utility of assessment. A functional approach to evaluating assessment quality. , 1987, The American psychologist.

[26]  T. Achenbach,et al.  Are American children's problems getting worse? A 13-year comparison. , 1993, Journal of the American Academy of Child and Adolescent Psychiatry.

[27]  K. Geisinger The Metamorphosis to Test Validation , 1992 .

[28]  R. Heinrichs,et al.  Current and emergent applications of neuropsychological assessment: Problems of validity and utility. , 1990 .

[29]  Fumiko Samejima,et al.  Estimation of reliability coefficients using the test information function and its modifications , 1994 .

[30]  Michael L. O'Brien A Rasch approach to scaling issues in testing Hispanics. , 1992 .

[31]  S. Messick Meaning and Values in Test Validation: The Science and Ethics of Assessment , 1989 .

[32]  James R. Flynn,et al.  The mean IQ of Americans: Massive gains 1932 to 1978. , 1984 .

[33]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[34]  P. Lees-Haley Alice in Validityland, or the dangerous consequences of consequential validity. , 1996 .

[35]  Samuel Messick,et al.  STANDARDS OF VALIDITY AND THE VALIDITY OF STANDARDS IN PERFORMANCE ASSESSMENT , 2005 .

[36]  L. Cronbach,et al.  Construct validity in psychological tests. , 1955, Psychological bulletin.

[37]  Donald B. Rubin,et al.  Reliability of measurement in psychology: From Spearman-Brown to maximal reliability. , 1996 .

[38]  M. Reckase Consequential Validity from the Test Developer's Perspective. , 2005 .

[39]  William Stout,et al.  A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF , 1993 .

[40]  J. Flynn Searching for Justice The Discovery of IQ Gains Over Time , 1999 .

[41]  N. L. Johnson,et al.  Systems of frequency curves generated by methods of translation. , 1949, Biometrika.

[42]  B. Bracken,et al.  Inter-parent agreement on four preschool behavior rating scales : Effects of parent and child gender , 1996 .

[43]  John E. Hunter,et al.  Development of a general solution to the problem of validity generalization. , 1977 .

[44]  D. Watson,et al.  Constructing validity: Basic issues in objective scale development , 1995 .

[45]  T. Cleary TEST BIAS: PREDICTION OF GRADES OF NEGRO AND WHITE STUDENTS IN INTEGRATED COLLEGES , 1968 .

[46]  D. Campbell,et al.  Convergent and discriminant validation by the multitrait-multimethod matrix. , 1959, Psychological bulletin.

[47]  L. Cronbach,et al.  Generalizability of scores influenced by multiple sources of variance , 1965, Psychometrika.

[48]  Donald T. Campbell,et al.  Citations do not solve problems. , 1992 .

[49]  L. Cronbach,et al.  THEORY OF GENERALIZABILITY: A LIBERALIZATION OF RELIABILITY THEORY† , 1963 .

[50]  C. Ramey,et al.  Evidence for the Need to Renorm the Bayley Scales of Infant Development Based on the Performance of a Population-based Sample of 12-month-old Infants , 1986 .

[51]  R. Guion Content Validity—The Source of My Discontent , 1977 .