Classroom observation systems in context: A case for the validation of observation systems

Researchers and practitioners sometimes presume that using a previously “validated” instrument will produce “valid” scores; however, contemporary views of validity suggest that there are many reasons this assumption can be faulty. In order to demonstrate just some of the problems with this view, and to support comparisons of different observation protocols across contexts, we introduce and define the conceptual tool of an observation system. We then describe psychometric evidence of a popular teacher observation instrument, Charlotte Danielson’s Framework for Teaching, in three use contexts—a lower-stakes research context, a lower-stakes practice-based context, and a higher-stakes practice-based context. Despite sharing a common instrument, we find the three observation systems and their associated use contexts combine to produce different average teacher scores, variation in score distributions, and different levels of precision in scores. However, all three systems produce higher average scores in the classroom environment domain than the instructional domain and all three sets of scores support a one-factor model, whereas the Framework posits four factors. We discuss how the dependencies between aspects of observation systems and practical constraints leave researchers with significant validation challenges and opportunities.

[1]  Richard J. Tannenbaum,et al.  Behaviorally anchored rating scales: An application for evaluating teaching practice , 2016 .

[2]  Rachel Roegman,et al.  Unpacking the data: an analysis of the use of Danielson’s (2007) Framework for Professional Practice in a teaching residency program , 2016 .

[3]  Rachel Garrett,et al.  Classroom Composition and Measured Teacher Performance , 2016 .

[4]  Robert C. Pianta,et al.  An Argument Approach to Observation Protocol Validity , 2012 .

[5]  Michael T. Kane,et al.  Validation as a Pragmatic, Scientific Activity , 2013 .

[6]  Brian Gill,et al.  Professional Practice, Student Surveys, and Value-Added: Multiple Measures of Teacher Effectiveness in the Pittsburgh Public Schools. REL 2014-024. , 2014 .

[7]  S. N. Beretvas,et al.  The Effective Elementary Classroom Literacy Environment: Examining the Validity of the TEX-IN3 Observation System , 2004 .

[8]  Pam Grossman,et al.  Respecting complexity in measures of teaching: Keeping students and schools in focus , 2016 .

[9]  Alija Kulenović,et al.  Standards for Educational and Psychological Testing , 1999 .

[10]  Charalambos Y. Charalambous,et al.  Studying mathematics instruction through different lenses: setting the ground for understanding instructional quality more comprehensively , 2018 .

[11]  Robert Mannell,et al.  Does a rater’s familiarity with a candidate’s pronunciation affect the rating in oral proficiency interviews? , 2011 .

[12]  J. R. Lockwood,et al.  Trends in Classroom Observation Scores , 2015, Educational and psychological measurement.

[13]  Robert C. Pianta,et al.  La & Classroom assessment scoring system (CLASS) manual: Pre-K. Baltimore: Brookes Publishing Co. , 2008 .

[14]  James L. Floman,et al.  Emotional Bias in Classroom Observations: Within-Rater Positive Emotion Predicts Favorable Assessments of Classroom Quality , 2017 .

[15]  Julie Cohen,et al.  Does Teaching Quality Cross Subjects? Exploring Consistency in Elementary Teacher Practice Across Subjects , 2018, AERA Open.

[16]  Klaas van Veen,et al.  Once is not enough: Establishing reliability criteria for feedback and evaluation decisions based on classroom observations , 2016 .

[17]  M. Kane Validating the Interpretations and Uses of Test Scores , 2013 .

[18]  George Engelhard,et al.  Evaluating Rater Accuracy in Performance Assessments. , 1996 .

[19]  Heather C. Hill,et al.  State and Local Efforts to Investigate the Validity and Reliability of Scores from Teacher Evaluation Systems , 2014, Teachers College Record: The Voice of Scholarship in Education.

[20]  Sandy Taut,et al.  Classroom observation for evaluating and improving teaching: An international perspective , 2016 .

[21]  Douglas O. Staiger,et al.  Gathering Feedback for Teaching: Combining High-Quality Observations with Student Surveys and Achievement Gains. Research Paper. MET Project. , 2012 .

[22]  Charalambos Y. Charalambous,et al.  Classroom observation frameworks for studying instructional quality: looking back and looking forward , 2018 .

[23]  Alan H. Schoenfeld,et al.  On Classroom Observations , 2018, Journal for STEM Education Research.

[24]  Klaas van Veen,et al.  Individual differences in teacher development: An exploration of the applicability of a stage model to assess individual teachers , 2017 .

[25]  Daniel F. McCaffrey,et al.  Inferring constructs of effective teaching from classroom observations: An application of Bayesian exploratory factor analysis without restrictions , 2015 .

[26]  Charlotte F. Danielson Enhancing Professional Practice: A Framework for Teaching , 1996 .

[27]  Charalambos Y. Charalambous,et al.  Validating Arguments for Observational Instruments: Attending to Multiple Sources of Variation , 2012 .

[28]  Morgaen L. Donaldson,et al.  From Tinkering to Going “Rogue”: How Principals Use Agency When Enacting New Teacher Evaluation Systems , 2018, Educational Evaluation and Policy Analysis.

[29]  Jilliam N. Joe,et al.  Scoring Design Decisions: Reliability and the Length and Focus of Classroom Observations , 2015 .

[30]  George Leckie,et al.  Rater Effects on Essay Scoring: A Multilevel Analysis of Severity Drift, Central Tendency, and Rater Experience , 2011 .

[31]  Nathan D. Jones,et al.  Understanding Consequential Assessment Systems of Teaching: Year 2 Final Report to Los Angeles Unified School District. Research Memorandum No. RM-15-12. , 2015 .

[32]  Deborah M. Netolicky Coaching for professional growth in one Australian school: “oil in water” , 2016 .

[33]  Robert C. Pianta,et al.  Teaching Through Interactions in Secondary School Classrooms , 2014, The Journal of early adolescence.

[34]  D. Muijs,et al.  State of the art – teacher effectiveness and professional learning , 2014 .

[35]  Courtney A. Bell,et al.  Approaches to Evaluating Teacher Effectiveness: A Research Synthesis. , 2008 .

[36]  Sandy Taut,et al.  The Development and Implementation of a National, Standards-based, Multi-method Teacher Performance Assessment System in Chile , 2014 .

[37]  Lorrie A. Shepard,et al.  Evaluating test validity: reprise and progress , 2016 .

[38]  Brian M. Stecher,et al.  Validation of a National Teacher Assessment and Improvement System , 2012 .

[39]  Lennart Jølle Rater strategies for reaching agreement on pupil text quality , 2015 .

[40]  Anna-Katharina Praetorius,et al.  One lesson is all you need? Stability of instructional quality across lessons. , 2014 .

[41]  Linda Darling-Hammond,et al.  Teaching in the Flat World: Learning from High-Performing Systems , 2015 .

[42]  John H. Tyler,et al.  Identifying Effective Classroom Practices Using Student Achievement Data. , 2011 .

[43]  Charalambos Y. Charalambous,et al.  When Rater Reliability Is Not Enough , 2012 .

[44]  Edward W. Wolfe,et al.  Monitoring Rater Performance Over Time: A Framework for Detecting Differential Accuracy and Differential Scale Category Use , 2009 .

[45]  Frederick M. Hess,et al.  What Did Race to the Top Accomplish , 2015 .

[46]  Cecilie P. Dalland,et al.  Video studies and the challenge of selecting time scales , 2020, International Journal of Research & Method in Education.

[47]  Jilliam N. Joe,et al.  Foundations of Observation: Considerations for Developing a Classroom Observation System That Helps Districts Achieve Consistent and Accurate Scores. MET Project, Policy and Practice Brief. , 2013 .

[48]  Matthew M. Chingos,et al.  Evaluating Teachers with Classroom Observations: Lessons Learned in Four Districts. , 2014 .

[49]  Manfred Prenzel,et al.  How to run a video study : technical report of the IPN video study , 2005 .

[50]  R. Linn Educational measurement, 3rd ed. , 1989 .

[51]  Terrance D. Savitsky,et al.  Uncovering Multivariate Structure in Classroom Observations in the Presence of Rater Errors , 2015 .

[52]  Brian E. Clauser,et al.  An Examination of Rater Drift within a Generalizability Theory Framework. , 2009 .

[53]  Mark C. White,et al.  Rater Performance Standards for Classroom Observation Instruments , 2018, Educational Researcher.

[54]  J. Stigler,et al.  The TIMSS Videotape Classroom Study: Methods and Findings from an Exploratory Research Project on Eighth-Grade Mathematics Instruction in Germany, Japan, and the United States. A Research and Development Report. , 1999 .

[55]  M. Kraft,et al.  Can Principals Promote Teacher Development as Evaluators? A Case Study of Principals’ Views and Experiences , 2016, Educational administration quarterly : EAQ.