Towards a standard set of acoustic features for the processing of emotion in speech.

Researchers concerned with the automatic recognition of human emotion in speech have proposed a considerable variety of segmental and supra-segmental acoustic descriptors. These range from prosodic characteristics and voice quality to acoustic correlates of articulation, and represent unequal degrees of perceptual elaboration. Recently, evidence has been reported from first comparisons on multiple speech databases that spectral and cepstral characteristics might have the greatest potential for the task. Yet, novel acoustic correlates are constantly proposed, as the question of the optimal representation remains disputed. The task of evaluating suggested correlates is non-trivial, as no agreed "standard" set and method of assessment exists, and inter-corpus substantiation is usually lacking. Such substantiation is particularly difficult owing to the divergence of models employed for the ground-truth description of emotion. To ease this challenge, using the arousal-valence space as the predominant means for...