Circularity effects in corpus studies – why annotations sometimes go round in circles

Abstract Linguistic corpus research mainly deals with annotated data rather than raw data. This contribution investigates the status of annotated corpus data in empirical linguistics. We argue that annotators should be regarded as co-producers of data; annotations depend on certain theoretical categories, hence they are theory-laden. Annotation categories differ with respect to different (structural and functional) levels of description and different degrees of canonisation, e.g. annotating a corpus item as a noun at a structural level is a highly canonised decision in most cases whereas the allocation of a cognitive-functional annotation category like expression with identifyable referent is subject to specific theories that often lack established definitions. As a minimal requirement, annotated data have to allow the reconstruction of the original raw data and annotations should be constrained by guidelines in order to avoid that the annotator’s decisions are arbitrary. Annotation problems resulting from the close relation between annotation categories and their theoretical prerequisites are exemplified using a newspaper corpus study and a study on a second-language acquisition corpus, both studies dealing with anaphora as a discourse-functional phenomenon. It is shown that the problems discussed have their origins in two circles: the first one results from the interplay of deductive and inductive procedures that causes an impact of theory on annotation; the second circle originates from the relations between language structures and their discourse functions, the latter failing to be observable independently from the structural features of the utterance.