Pitch-based emphasis detection for characterization of meeting recordings

The automatic extraction of key utterances in spoken data has emerged as an interesting and difficult topic in automatic speech recognition. "Emphasis" or "excitement" may be a useful identifier for these utterances of interest. We undertake the task of reliably and automatically identifying emphasized or excited utterances in natural speech in a meeting setting. We start by endeavoring to establish reliable ground truth emphasis labels by using several hand-labelers. The results show that human listeners can reliably identify emphasized utterances in meeting recordings. We then build an automatic emphasis detection system, which uses normalized pitch as its only acoustic predictor. The results show that this pitch-based emphasis detection scheme can distinguish between non-emphasized and emphasized utterances with an accuracy of 92% when ambiguous cases are excluded, a rate comparable to human interlabeler agreement.