Can you believe my eyes? The importance of interobserver reliability statistics in observations of animal behaviour

Interobserver (or inter-rater) reliability is a vital part of all psychological studies that use an observational methodology to address questions of human behaviour. Concerns about reliability in these studies have long since left the arena of ‘should we use an interobserver reliability statistic?’ for debate on the particular type of statistic to be used, and academic careers have been built on this question. In stark contrast, however, it appears to be extremely rare to see interobserver reliability addressed at all in observational studies of animal behaviour. While we would never claim that this omission would or should be a basis for deeming a paper unacceptable, or disregarding its conclusions, we do feel that observational procedures are an integral part of the methodology of many studies, and that their inclusion in published papers should be commonplace. As an informal measure of the frequency with which interobserver reliability was addressed in papers involving the observation of animal behaviour, we surveyed articles recently published in the journal Animal Behaviour. Our data came from volume 75 (3, 4) and volume 76 (1) of the journal, which were at the time the most recent issues available. We examined the first 100 articles (alphabetical by first author) that were methodologically relevant. Articles included in the survey used observational methodologies such as classification of behaviours, judgment of occurrence (or nonoccurrence) of behaviours, and the counting of instances of behaviour. Articles deemed not methodologically relevant and thus excluded from the analysis included studies using computer modelling techniques, studies in which results were strictly nominal (i.e. presence or absence of a physical object or the number of objects present), and studies that dealt with measurable quantitative variables such as weight, length or hormone levels. Studies such as these, of course, are also subject to error on the part of a single experimenter and are always improved by multiple, reliable experimenters; however, the problem is less pressing than it is in studies that deal with strictly behavioural observations. Ninety-six of these 100 articles did not address interobserver reliability in their published text. Of these 96 articles, three mentioned some form of replication of the observations, seven * Correspondence: A. B. Kaufman, Department of Neuroscience, LSP 2915, University of California, Riverside, Riverside, CA 92521, U.S.A. E-mail address: allison.kaufman@email.ucr.edu (A.B. Kaufman). 1 R. Rosenthal is at the Department of Psychology, 3111B Psychology Building, University of California, Riverside, Riverside, CA 92521, U.S.A. E-mail: robert. rosenthal@ucr.edu

[1]  Hoi K. Suen,et al.  Effects of the use of percentage agreement on behavioral observation reliabilities: A reassessment , 1985 .

[2]  Roger Bakeman,et al.  Observing Interaction: An Introduction to Sequential Analysis , 1986 .

[3]  Donald B. Rubin,et al.  A Simple, General Purpose Display of Magnitude of Experimental Effect , 1982 .

[4]  M. Banerjee,et al.  Beyond kappa: A review of interrater agreement measures , 1999 .

[5]  Laura Spinney Eyewitness identification: Line-ups on trial , 2008, Nature.

[6]  E. Palagi,et al.  Sharing the motivation to play: the use of signals in adult bonobos , 2008, Animal Behaviour.

[7]  M. Tomasello,et al.  The early ontogeny of human–dog communication , 2008, Animal Behaviour.

[8]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[9]  L. Vigilant,et al.  Kin-biased social behaviour in wild adult female white-faced capuchins, Cebus capucinus , 2008, Animal Behaviour.

[10]  S. Reader,et al.  Gaze following in monkeys is modulated by observed facial expressions , 2008, Animal Behaviour.

[11]  D. Cicchetti Methodological Commentary The Precision of Reliability and Validity Estimates Re-Visited: Distinguishing Between Clinical and Statistical Significance of Sample Size Requirements , 2001 .

[12]  R. Rosenthal,et al.  Experimenter effects in behavioral research , 1968 .

[13]  D. Cicchetti,et al.  Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior. , 1981, American journal of mental deficiency.

[14]  Ralph L. Rosnow,et al.  Essentials of Behavioral Research: Methods and Data Analysis , 1984 .

[15]  J. Fleiss,et al.  Statistical methods for rates and proportions , 1973 .

[16]  M. Lombard,et al.  Content Analysis in Mass Communication: Assessment and Reporting of Intercoder Reliability , 2002 .