Annotating emotion in dialogue – Issues and Approaches

Adopting statistical tests from content analysis allows dialogue analysts to evaluate their work and make comparisons with other research in the field. However, since some aspects of communicative behaviour are more difficult to study than others, such broad comparisons may lead us to avoid interesting areas of research. Emotion in dialogue is one such area. 1 Dialogue analysis and annotation There are a wealth of applications for which we might want a computer to able to understand human dialogue. These include natural language interfaces to information sources of services (Gorin et al., 1997) and development of communicative agents for educational, entertainment or commercial purposes (Piwek, 2003). Even applications which are not concerned solely with dialogue such as speech-tospeech translation can benefit from an understanding of how humans behave during a dialogue (Reithinger and Maier, 1995). To facilitate such applications, the role of dialogue analysis is to investigate, formalise, and document communicative behaviour, allowing us to study it empirically, statistically and computationally. One of the tools available to a dialogue analyst is Dialogue Annotation. Dialogue annotation is the process of labelling segments of dialogue with labels that describe some property of that segment. For example, a dialogue may be split up into utterances with each utterance being labelled for the topic to which it refers e.g. – A. How was your holiday? [Holiday] B. I had to cancel it because I was ill. [Holiday] A. Illness is a real drag. [Illness] There is a consensus that the richness and complexity of dialogue can best be studied using a multilayered approach with each layer labelling a separate property of the dialogue’s content. Using this approach a single dialogue corpus can be used to investigate a number of different phenomena and also relationships between different layers can be studied. 2 Emotion in dialogue As our understanding of dialogue and communicative behaviour increases, and as the applications we develop from that understanding become more sophisticated, it becomes important to investigate more subtle and possibly more interesting properties of dialogue. One such area of investigation is emotion. There are number of applications which would benefit from an understanding of how emotion influences the way we communicate. For example, those developing communicative agents either to talk to each other or to humans, are interested in making the the agents more believable and the speech that they generate more natural (Ball and Breeze, 1998; André et al., 1998; Piwek, 2003). Also, for automated call centres it can be important to identify when a caller is becoming agitated or frustrated with the system. Clearly, an understanding of the relationship between the emotion of a speaker and the speech that they produce would help achieve these aims, and others. One way in which this understanding of emotion could be gained is by the development of dialogue corpus annotated for the emotion expressed by the speakers. In order to do this an appropriate annotation scheme must be developed. How this aims may be achieved is discussed in section 5. 3 Annotating subtle, rare, or subjective phenomena To date, dialogue analysis has almost exclusively been concerned with the identification of objective and quantifiable aspects of its content. The reasons for this are best conveyed based on an understanding of how dialogue annotation is planned and executed. 3.1 Developing annotation schemes To annotate a dialogue corpus, labelling is usually performed by human annotators following an annotation scheme. These schemes describe the labels which can be applied and the circumstances in which to apply them. Examples of dialogue annotation (coding) schemes include one for coding children’s speech in the CHILDES project (MacWhinney, 1998) and many schemes for coding dialogue acts (Core and Allen, 1997; Di Eugenio et al., 1998; Jurafsky et al., 1997). The design of these schemes is frequently motivated by some theory about dialogue or language in general. For example, Searle’s theory of speech acts (Searle, 1969) which describes how actions are performed by speech, has lead to the idea of dialogue acts which are used to describe the function of an utterance. In turn, the choice of which labels to include in a scheme is often motivated by a particular domain or problem. For instance, the above mentioned CHILDES scheme uses labels which can be used to describe the utterances used by children. A scheme used to identify positive and negative behaviour of a health professional attempting to elicit the concerns of a cancer patient (Heaven and Maguire, 1997) uses labels which highlight such behaviour. The next important step in developing an annotation scheme is to prove that it is appropriate for the task for which it is developed. For a scheme to be valid it is important that it produces reliable results. Klaus Krippendorff proposed three measures of reliability — stability, reproducibility and accuracy. For our purposes, stability means that the same scheme applied more that once to the same data at different points in time will give the same results, reproducibility means that more than one annotator applying the scheme should yield similar results and accuracy means that these results should match some ‘correct’ standard. Usually, validity for dialogue annotation schemes is assessed using interrater reliability, which is a reproducibility test. We shall return to the troublesome subject of inter-rater reliability later, but suffice to say that for restricted domains and clearly defined labels, satisfactory levels of reliability can often be obtained. 3.2 Issues regarding subtler phenomena Alongside these types of annotation, a multi-layered annotation may be augmented by layers describing phenomena which are more subtle or subjective, such as emotion or speaker intention. Consider the above process being applied in order to develop schemes for layers such as these. Since emotion and intention are less closely associated with linguistics than, say, dialogue acts, choosing labels that apply to individual portions of a dialogue is consequently a more difficult task. Furthermore describing these types of phenomena is harder than describing, for example, the topic of conversation. For these reasons, creating a suitable list of labels from which to construct an annotation scheme for subtle, subjective or even non-linguistic phenomena can pose a serious obstacle on the path to annotating it in dialogue. If after some persistence a list of labels is constructed and an annotation manual produced the real difficulty of proving the validity of the scheme is upon us. Results for tests of stability and reliability are heavily influenced by the ease with which an annotator is able to identify the circumstances under which each label is applicable. For example when labelling utterances according to whether they were successfully completed as one might do when applying the DAMSL scheme (Core and Allen, 1997) identifying which utterances were completed and which were not is a fairly trivial task e.g. — A. Have you seen the new secretary? [Complete] B. Yeah, what’s her name? [Complete] A. It’s errr.... [Abandoned] B. Oh never mind. [Complete] The same cannot be said for labelling the intention of a speaker, since this facet of communicative behaviour is much less evident from the content of someone’s speech. Harder still is establishing whether any given annotation of a dialogue is ‘correct’ as one would by required to do to pass the third of Krippendorff’s tests for reliability – accuracy. Increasingly there is a reliance on inter-rater reliability tests to measure the quality of dialogue annotation research. Of course, establishing the validity of any scheme and striving to develop schemes which are as reliable as possible is an important part of dialogue analysis. However insisting on a minimum ‘score’ from a certain reliability test before a scheme may be accepted by the field would make the honourable aim of researching difficult (and therefore interesting) aspects of dialogue very difficult. 4 Inter–rater reliability In 1996 Jean Carletta recognised that in order to improve the consistency of computational linguistics research and facilitate comparisons between different researchers’ results a more rigorous, common approach to evaluation was required (Carletta, 1996). This certainly applies to dialogue annotation schemes. 4.1 Selecting an appropriate reliability test In that paper it is suggested that the Kappa statistic (Seigal, 1988) is a suitable measure of agreement with which to validate annotation. The Kappa statistic measures the level of agreement between any number of annotators assigning nominal labels to objects, awarding scores of between 0 and 1 where 1 denotes perfect agreement and 0 is equivalent to the level of agreement that could be expected from random behaviour by the annotators. However, the suitability of Kappa for this purpose has been questioned because of issues regarding the accuracy with which it measures the level of agreement expected by chance (Krippendorff, 1980). An alternative agreement measure that was also mentioned in (Carletta, 1996) is Krippendorff’s Alpha statistic (Krippendorff, 1980). Alpha has similar properties to Kappa but considers the frequency with which labels are used, when calculating the level of agreement that could be expected from chance coding, which was the source of Kappa’s inaccuracy. Considering this, in most cases it would appear sensible to use Alpha in place of Kappa. The researcher interested in annotating the type of subtle phenomena discussed in section 3 may be called upon to use more sophisticated labelling approaches than simple categories. For instance, we could use a numerical scale to label the level of confidence that a speaker has in an assertion, which could have app