Good reasons for high variability (low inter-rater reliability) in performance assessment: Toward a fuzzy logic model

Abstract Regular performance assessment is an integral part of (high-) risk industries. Past research shows, however, that in many fields, inter-rater reliabilities tend to be moderate to low. This study was designed to investigate the variability of performance assessment in a naturalistic setting in aviation. A modified think-aloud protocol was used as research design to investigate the reasoning pairs of pilots use to assess the performance of an airline captain in a high-risk situation. Standard protocol analysis and interaction analysis methods were employed in the analysis of transcribed verbal protocols. The analyses confirm high variability in performance assessment and reveal the good, albeit fuzzy, justifications that assessor pairs use to ground their assessments. A fuzzy logic model exhibits a good approximation between predicted and actual ratings. Implications for the practice of performance assessment are provided. Relevance to industry Many industries aim at achieving consistency in identifying true performance levels. However, if the variability in performance assessment is a real phenomenon, as reported here, then practitioners and researchers might have to test whether it can be used positively, e.g., as opportunity for improving the resilience of crews.

[1]  Janice Langan-Fox,et al.  Human–automation teams and adaptable control for future air traffic management , 2009 .

[2]  Alan H. Schoenfeld,et al.  On Paradigms and Methods: What Do You Do When the Ones You Know Don't Do What You Want Them To? issues in the Analysis of Data in the Form of Videotapes , 1992 .

[3]  R L Helmreich,et al.  The evolution of Crew Resource Management training in commercial aviation. , 1999, The International journal of aviation psychology.

[4]  Neville A. Stanton,et al.  Is SA shared or distributed in team work? An exploratory study in an intelligence analysis task , 2011 .

[5]  Michael Roth,et al.  Understanding Variance in Pilot Performance Ratings Two Studies of Flight Examiners, Captains, and First Officers Assessing the Performance of Peers , 2013 .

[6]  Gary D. Borich,et al.  Educational Testing and Measurement : Classroom Application and Practice , 1984 .

[7]  Susanna Loeb,et al.  Principal Time-Use and School Effectiveness. Working Paper No. 34. , 2009 .

[8]  H. Garfinkel Studies in Ethnomethodology , 1968 .

[9]  Michael Roth,et al.  Assessment of Nontechnical Skills From Measurement to Categorization Modeled by Fuzzy Logic , 2013 .

[10]  A. Muijtjens,et al.  Workplace-based assessment: effects of rater expertise , 2010, Advances in health sciences education : theory and practice.

[11]  W. A. Bemelman,et al.  Problems and pitfalls in modern competency-based laparoscopic Training , 2011, Surgical Endoscopy.

[12]  Wolff-Michael Roth,et al.  Contradictions in the practices of training for and assessment of competency: A case study from the maritime domain , 2008 .

[13]  Sidney Dekker,et al.  From Threat and Error Management (TEM) to Resilience , 2006 .

[14]  Harold Garfinkel,et al.  On Formal Structures of Practical Actions , 2005 .

[15]  Lorne M. Sulsky,et al.  Rating Formats and Rater Training Redux: A Context-Specific Approach for Enhancing the Effectiveness of Performance Management , 2009 .

[16]  Rhona Flin,et al.  Identifying the team skills required by nuclear power plant operations personnel , 2008 .

[17]  Coşkun Özkan,et al.  A FUZZY METHOD ON DETERMINING OF JOB AND PERSONNEL EVALUATION RESULTS, AND MATCHING THEM WITH SUGGESTED MODEL , 2010 .

[18]  David B. Kaber,et al.  Situation awareness implications of adaptive automation for information processing in an air traffic control-related task , 2006 .

[19]  Johan Bergström,et al.  From Crew Resource Management to Operational Resilience , 2011 .

[20]  D. M. Binney,et al.  Value and face validity of objective structured assessment of technical skills (OSATS) for work based assessment of surgical skills in obstetrics and gynaecology , 2008, Medical teacher.

[21]  Eduardo Salas,et al.  Developing Teams and Team Leaders: Strategies and Principles. , 2004 .

[22]  Wolff-Michael Roth,et al.  Peer Assessment of Aviation Performance: Inconsistent for Good Reasons , 2015, Cogn. Sci..

[23]  Wolff-Michael Roth,et al.  Doing teacher-research : a handbook for perplexed practitioners , 2007 .

[24]  Lucy Suchman,et al.  Human-Machine Reconfigurations: Plans and Situated Actions , 2006 .

[25]  Michael J. Taber,et al.  Development and evaluation of an offshore oil and gas Emergency Response Focus Board , 2013 .

[26]  K. A. Ericsson,et al.  Protocol analysis: Verbal reports as data, Rev. ed. , 1993 .

[27]  Austin Henderson,et al.  Interaction Analysis: Foundations and Practice , 1995 .

[28]  Sidney Dekker,et al.  Sharing the Burden of Flight Deck Automation Training , 2000 .

[29]  Jeffrey T. Hansberger,et al.  Improving Rater Calibration in Aviation: A Case Study , 2002 .

[30]  Enrico Ciavolino,et al.  A fuzzy set theory based computational model to represent the quality of inter-rater agreement , 2014 .

[31]  Tarcisio Abreu Saurin,et al.  RETRACTED: Identification of non-technical skills from the resilience engineering perspective: A case study of an electricity distributor , 2013 .

[32]  Marjan J. B. Govaerts,et al.  Broadening Perspectives on Clinical Performance Assessment: Rethinking the Nature of In-training Assessment , 2007, Advances in health sciences education : theory and practice.

[33]  Gloria Dall'Alba,et al.  A model for integrating technical skills and NTS in assessing pilots’ performance , 2010 .

[34]  Patrick Stuart Murray,et al.  The Development of Airline Pilot Skills through Simulated Practice , 2010 .

[35]  Alexander Karp,et al.  Assessment in Mathematics in Russian Schools , 2011 .

[36]  Wolff-Michael Roth,et al.  A Holistic View of Cockpit Performance: An Analysis of the Assessment Discourse of Flight Examiners , 2014 .

[37]  Wolff-Michael Roth,et al.  Where is the Context in Contextual Word Problems?: Mathematical Practices and Products in Grade 8 Students' Answers to Story Problems. , 1996 .

[38]  A. Esogbue,et al.  Measurement and valuation of a fuzzy mathematical model for medical diagnosis , 1983 .

[39]  Lynne Martin,et al.  Development of the NOTECHS (non-technical skills) system for assessing pilots’ CRM skills , 2018, Human Factors and Aerospace Safety.

[40]  Augustine O. Esogbue,et al.  Fuzzy sets and the modelling of physician decision processes, part II: fuzzy diagnosis decision models , 1980 .

[41]  K. A. Ericsson,et al.  Protocol Analysis: Verbal Reports as Data , 1984 .