论文信息 - Pitch-based emphasis detection for characterization of meeting recordings

Pitch-based emphasis detection for characterization of meeting recordings

The automatic extraction of key utterances in spoken data has emerged as an interesting and difficult topic in automatic speech recognition. "Emphasis" or "excitement" may be a useful identifier for these utterances of interest. We undertake the task of reliably and automatically identifying emphasized or excited utterances in natural speech in a meeting setting. We start by endeavoring to establish reliable ground truth emphasis labels by using several hand-labelers. The results show that human listeners can reliably identify emphasized utterances in meeting recordings. We then build an automatic emphasis detection system, which uses normalized pitch as its only acoustic predictor. The results show that this pitch-based emphasis detection scheme can distinguish between non-emphasized and emphasized utterances with an accuracy of 92% when ambiguous cases are excluded, a rate comparable to human interlabeler agreement.

Daniel P. W. Ellis | Lyndon Kennedy | D. Ellis | L. Kennedy | L. S. Kennedy

[1] Barry Arons. Pitch-based emphasis detection for segmenting speech recordings , 1994, ICSLP.

[2] Richard T. Cauldwell. WHERE DID THE ANGER GO? THE ROLE OF CONTEXT IN INTERPRETING EMOTION IN SPEECH , 2000 .

[3] Andreas Stolcke,et al. Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[4] Andreas Stolcke,et al. The Meeting Project at ICSI , 2001, HLT.

[5] Hideki Kawahara,et al. Comparative evaluation of F0 estimation algorithms , 2001, INTERSPEECH.

[6] Hideki Kawahara,et al. YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[7] Andreas Stolcke,et al. Prosody-based automatic detection of annoyance and frustration in human-computer dialog , 2002, INTERSPEECH.

[8] Elizabeth Shriberg,et al. Spotting "hot spots" in meetings: human judgments and prosodic cues , 2003, INTERSPEECH.