Using word-level pitch features to better predict student emotions during spoken tutoring dialogues

Abstract In this paper, we advocate for the usage of word-level pitch features for detecting user emotional states during spoken tutoring dialogues. Prior research has primarily focused on the use of turn-level features as predictors. We compute pitchfeatures at the word level and resolve the problem of combining multiple features per turn using a word-levelemotion model . Even under a very simple word-level emotionmodel, our results show an improvement in prediction using word-level features over using turn-level features. We find that the advantage of word-level features lies in a betterprediction of longer turns. 1. Introduction We investigate the utility of using pitch features applied at the word level for the task of predicting student emotions in twocorpora of spoken tutoring dialogues. Motivation for thiswork comes from the performance gap between human tutorsand current machine tutors; typically students tutored byhuman tutors learn more than students tutored by computer tutors. One of the methods currently being explored as a wayof closing this gap is to incorporate affective reasoning intocurrent computer tutoring systems, including dialogue-based tutoring systems, e.g. [1, 2].Previous spoken dialogue research in other domains hasshown that turn-level prosodic, lexical, dialogue, and otherfeatures can be used to predict user emotional states [3-5]. Tobetter approximate the prosodic information [6] uses word-level features and successfully applies them to a different emotion detection task. To our knowledge, there is noprevious work that directly compares the impact of usingfeatures at the sub-turn rather than the turn level for emotionprediction. In this paper we are performing a first comparisonof the two levels for the task of detecting student emotional states.There are many choices for sub-turn units (breath groups, intonational phrases, syntactic chunks, words, syllables). Wewill use words as our sub-turn units because it is straightforward to do the segmentation and because theseunits have been used successfully by other researchers for similar tasks [6]. Moreover, in a real-time dialogue system,the segmentation is available as a byproduct of the automatic speech recognition. To simplify our word versus turn-level featurecomparison, we will focus

[1]  Shrikanth S. Narayanan,et al.  Combining acoustic and language information for emotion recognition , 2002, INTERSPEECH.

[2]  Ian Witten,et al.  Data Mining , 2000 .

[3]  Diane J. Litman,et al.  Predicting Student Emotions in Computer-Human Tutoring Dialogues , 2004, ACL.

[4]  P Taylor,et al.  Analysis and synthesis of intonation using the Tilt model. , 2000, The Journal of the Acoustical Society of America.

[5]  Elmar Nöth,et al.  How to find trouble in communication , 2003, Speech Commun..

[6]  Jack Mostow,et al.  Experimentally augmenting an intelligent tutoring system with human-supplied capabilities: adding human-provided emotional scaffolding to an automated reading tutor that listens , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[7]  Diane J. Litman,et al.  Exceptionality and Natural Language Learning , 2003, CoNLL.

[8]  Larry P. Heck,et al.  Modeling dynamic prosodic variation for speaker verification , 1998, ICSLP.

[9]  Andreas Stolcke,et al.  Prosody-based automatic detection of annoyance and frustration in human-computer dialog , 2002, INTERSPEECH.

[10]  L. Lamel,et al.  Emotion detection in task-oriented spoken dialogues , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[11]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[12]  Julia Hirschberg,et al.  Identifying User Corrections Automatically in Spoken Dialogue Systems , 2001, NAACL.