INVESTIGATING THE UTILITY OF ANALYTIC SCORING FOR THE TOEFL ACADEMIC SPEAKING TEST (TAST)

This study explores the utility of analytic scoring for the TOEFL® Academic Speaking Test (TAST) in providing useful and reliable diagnostic information in three aspects of candidates' performance: delivery, language use, and topic development. G studies were used to investigate the dependability of the analytic scores, the distinctness of the analytic dimensions, and the variability of analytic score profiles. Raters' perceptions of dimension separability were also obtained. Based on the phi coefficients and standard errors of measurement (SEMs), the dependability of analytic scores averaged across six tasks and double ratings was acceptable for both operational and practice settings. However, scores averaged across two tasks and double ratings were not reliable enough for operational use. Correlations among the analytic scores by task were high, but those between delivery and topic development were lower, and these results were corroborated by raters' perceptions. When averaged across tasks or task types (two or more tasks), correlations among the analytic scores were very high, and the profiles of scores were flat. The utility of analytic scoring is discussed, and both score dependability and whether analytic scores provide diagnostic information beyond that provided by holistic scores are considered.

[1]  Multivariate Generalizability Theory. , 1983 .

[2]  Eva Nick,et al.  The dependability of behavioral measurements: theory of generalizability for scores and profiles , 1973 .

[3]  Anne Brown,et al.  The effect of rater variables in the development of an occupation-specific language performance test , 1995 .

[4]  Julian C. Stanley,et al.  Differential Weighting: A Review of Methods and Empirical Studies1 , 1970 .

[5]  Robert L. Linn,et al.  Educational Assessment: Expanded Expectations and Challenges , 1993 .

[6]  Robert L. Linn,et al.  Performance-Based Assessment: Implications of Task Specificity , 2005 .

[7]  R. Brennan Elements of generalizability theory , 1983 .

[8]  Stephen B. Dunbar,et al.  Quality Control in the Development and Use of Performance Assessments , 1991 .

[9]  Ronald K. Hambleton,et al.  A Response to "Setting Reasonable and Useful Performance Standards" in the National Academy of Science's Grading the Nations Report Card , 2005 .

[10]  Donald B. Rubin,et al.  The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. , 1974 .

[11]  R. Shavelson,et al.  Sampling Variability of Performance Assessments. , 1993 .

[12]  R. Brennan Performance Assessments from the Perspective of Generalizability Theory , 2000 .

[13]  Nigel O'Brian,et al.  Generalizability Theory I , 2003 .

[14]  Lyle F. Bachman Problems in Examining the Validity of the ACTFL Oral Proficiency Interview , 1988, Studies in Second Language Acquisition.

[15]  Noreen M. Webb,et al.  Generalizability theory : 1973-1980 , 1991 .

[16]  George A. Marcoulides,et al.  Selecting Weighting Schemes in Multivariate Generalizability Studies , 1994 .

[17]  Christopher Wing-Tat Chiu Scoring Performance Assessments Based on Judgements: Generalizability Theory , 2001 .

[18]  Xiaohong Gao,et al.  Generalizability of Large-Scale Performance Assessments in Science: Promises and Problems , 1994 .

[19]  Robert L. Brennan,et al.  Generalizability of Performance Assessments , 2005 .

[20]  David B. Swanson,et al.  Assessment of clinical skills with standardized patients: State of the art , 1990 .

[21]  Evangeline Marlos Varonis,et al.  The Effect Of Familiarity On The Comprehensibility Of Nonnative Speech , 1984 .

[22]  Tracey M. Derwing,et al.  ACCENT, INTELLIGIBILITY, AND COMPREHENSIBILITY , 1997, Studies in Second Language Acquisition.

[23]  Christopher W. T. Chiu,et al.  A Method for Analyzing Sparse Data Matrices in the Generalizability Theory Framework , 2002 .

[24]  Lyle F. Bachman,et al.  The Evaluation of Communicative Language Proficiency: A Critique of the ACTFL Oral Interview , 1986 .

[25]  Dan Douglas,et al.  Theoretical underpinnings of the Test of Spoken English revision project , 1997 .

[26]  Richard J. Tannenbaum,et al.  Dependability of Scores for a New ESL Speaking Test : Evaluating Prototype Tasks , .

[27]  G. Joe,et al.  Some developments in multivariate generalizability , 1976 .

[28]  D. Eignor,et al.  Speaking Framework: A Working Paper , 2000 .

[29]  Mei Liu,et al.  Generalizability and Validity of a Mathematics Performance Assessment , 1996 .

[30]  Xiaohong Gao,et al.  Generalizability Analyses of Work Keys Listening and Writing Tests , 1995 .

[31]  Allan S. Cohen,et al.  Validating Measures of Performance , 2005 .

[32]  Dan Douglas,et al.  Testing speaking ability in academic contexts : theoretical considerations , 1997 .

[33]  N. Webb,et al.  Multivariate Generalizability of General Educational Development Ratings. , 1981 .

[34]  Michael T. Kane,et al.  An argument-based approach to validity. , 1992 .

[35]  Dan Douglas A new decade of language testing research : selected papers from the 1990 Language Testing Research Colloquium : dedicated in memory of Michael Canale , 1993 .

[36]  Robert L. Brennan,et al.  Mis) Conception About Generalizability Theory , 2005 .

[37]  Micheline Chalhoub-Deville,et al.  Second language interaction: current perspectives and future trends , 2003 .

[38]  Yong-Won Lee,et al.  DEPENDABILITY OF NEW ESL WRITING TEST SCORES: EVALUATING PROTOTYPE TASKS AND ALTERNATIVE RATING SCHEMES , 2005 .