Improving Scoring Consistency of Flight Performance through Inter-Rater Reliability Analyses

Students, as well as the other stake-holders of flight schools, must be sure that the scoring of flight performance is such that the scores are a meaningful indicator of the student’s performance rather than an arbitrary indicator of the instructor’s perception. The scores should be somewhat consistent from one instructor to another. The apparent inconsistency in scoring from one instructor to another can be examined by conducting inter-rater reliability (IRR) analyses. Inter-rater reliability measures the extent of agreement between two or more individual raters – it is used to measure the consistency of a scoring or rating system, and those who use it. This foundational investigation was designed to assess inter-rater reliability between instructor pilots when observing 10 sample flights performed by student pilots. Results of the study indicated that inter-rater reliability was low. Suggestions for improving the consistency of flight instructor scoring are discussed, as well as recommendations for future research.

[1]  N. Tarrier,et al.  The Psychotic Symptom Rating Scales (PSYRATS): Their usefulness and properties in first episode psychosis , 2007, Schizophrenia Research.

[2]  J. Tomaka,et al.  Inter-rater reliability of McKenzie assessment in patients with neck pain , 2006 .

[3]  Assessing the severity of panic disorder and agoraphobia: validity, reliability and objectivity of the Turkish translation of the Panic and Agoraphobia Scale (P&A). , 2002, Journal of anxiety disorders.

[4]  K. Gwet Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement Between Raters , 2002 .

[5]  T. Hergueta,et al.  Moroccan colloquial Arabic version of the Mini International Neuropsychiatric Interview (MINI): qualitative and quantitative validation , 2005, European Psychiatry.

[6]  B. Brewer,et al.  The Sport Injury Rehabilitation Adherence Scale: a reliable scale for use in clinical physiotherapy , 2007 .

[7]  Ara Darzi,et al.  The reliability of multiple objective measures of surgery and the role of human performance. , 2005, American journal of surgery.

[8]  H. Ellis,et al.  Diagnosing delusions: A review of inter-rater reliability , 2006, Schizophrenia Research.

[9]  R. Devellis Inter-Rater Reliability , 2005 .

[10]  J. Thuile,et al.  Inter-rater reliability of the French version of the core index for melancholia. , 2005, Journal of affective disorders.

[11]  F. Oort,et al.  Using standardized video cases for assessment of medical communication skills: reliability of an objective structured video examination by computer. , 2006, Patient education and counseling.

[12]  Belita Gordon,et al.  The Effect of Rating Augmentation on Inter-Rater Reliability: An Empirical Study of a Holistic Rubric. , 2000 .

[13]  K. Eva,et al.  Triage tool inter-rater reliability: a comparison of live versus paper case scenarios. , 2007, Journal of emergency nursing: JEN : official publication of the Emergency Department Nurses Association.

[14]  C. Schatschneider,et al.  Diagnosing agoraphobia in the context of panic disorder: examining the effect of DSM-IV criteria on diagnostic decision-making. , 2005, Behaviour research and therapy.

[15]  J. Diamond Cohen's kappa , 1991 .

[16]  Inter-rater reliability of connective tissue zones recognition , 1995 .

[17]  C. Callaway,et al.  Inter-rater reliability for witnessed collapse and presence of bystander CPR. , 2006, Resuscitation.

[18]  B. Mulsant,et al.  A Delusion Assessment Scale for Psychotic Major Depression: Reliability, Validity, and Utility , 2006, Biological Psychiatry.

[19]  Y. Kaneda,et al.  The serotonin syndrome: investigation using the Japanese version of the Serotonin Syndrome Scale , 2001, Psychiatry Research.

[20]  H. Tsang,et al.  Chinese version of the Assessment of Interpersonal Problem Solving Skills , 2006, Psychiatry Research.

[21]  M. Hotopf,et al.  The inter-rater reliability of mental capacity assessments. , 2007, International journal of law and psychiatry.

[22]  Raffaele Ferri,et al.  Inter-rater reliability of sleep cyclic alternating pattern (CAP) scoring and validation of a new computer-assisted CAP scoring method , 2005, Clinical Neurophysiology.

[23]  Y. S. Kim,et al.  Korean version of the diagnostic interview for genetic studies: Validity and reliability. , 2004, Comprehensive psychiatry.

[24]  L. Goodwin,et al.  An Analysis of Statistical Techniques Used in the Journal of Educational Psychology, 1979-1983 , 1985 .

[25]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[26]  C. Tzavara,et al.  Reliability of Greek version Gross Motor Function Classification System , 2007, Brain & development (Tokyo. 1979).

[27]  C. Renier,et al.  Measurement of the severity of rosacea. , 2004, Journal of the American Academy of Dermatology.

[28]  J. Michelson,et al.  Simulation in orthopaedic education: an overview of theory and practice. , 2006, The Journal of bone and joint surgery. American volume.

[29]  L. Mccleary,et al.  The social dysfunction index (SDI) for patients with schizophrenia and related disorders , 1993, Schizophrenia Research.