Joint processing of audio-visual information for the recognition of emotional expressions in human-computer interaction

Recent technological advances have enabled human users to interact with computers in ways previously unimaginable. Beyond the confines of the keyboard and mouse, new modalities to control the computer such as voice, gesture, and force-feedback are emerging. Among these, voice and vision are two natural modalities in human-to-human communication. Automatic speech recognition (ASR) technology has matured enough to allow users to dictate to a word processor or operate the computer using voice commands. Computer vision techniques have enabled the computer to see. Interacting with computers in these modalities is much more natural for people, and the progression is towards the kind of interaction between humans. Despite these advances, one necessary ingredient for natural interaction is still missing—emotions. Emotions play an important role in human-to-human communication and interaction, allowing people to express themselves beyond the verbal domain. The ability to understand human emotions is desirable for the computer in some applications such as computer-aided learning or user-friendly online help. This thesis addresses the problem of detecting human emotional expressions by computer from the voice and facial motions of the user. The computer is equipped with a microphone to listen to the user's voice, and a video camera to look at the user. Prosodic features in the audio and facial motions exhibited on the face can help the computer make some inferences about the user's emotional state, assuming the users are willing to show their emotions. Another problem it addresses is the coupling between voice and the facial expression. Sometimes the user moves the lips to produce the speech, and sometimes the user only exhibits facial expression without speaking any words. Therefore, it is important to handle these two modalities accordingly. In particular, a pure “facial expression detector” will not function properly when the person is speaking, and a pure “vocal emotion recognizer” is useless when the user is not speaking. In this thesis, a complementary relationship between audio and video is proposed. Although these two modalities do not couple strongly in time, they seem to complement each other. In some cases, similar facial expressions may have different vocal characteristics, and vocal emotions having similar properties may have distinct facial behaviors.