Perceptual audio features for unsupervised key-phrase detection

We propose a new type of audio feature (HFCC-ENS) as well as an unsupervised method for detecting short sequences of spoken words (key-phrases) within long speech recordings. Our technical contributions are threefold: Firstly, we propose to use bandwidth-adapted filterbanks instead of classical MFCC-style filters in the feature extraction step. Secondly, the time resolution of the resulting features is adapted to account for the temporal characteristics of the spoken phrases. Thirdly, the key-phrase detection step is performed by matching sequences of the resulting HFCC-ENS features with features extracted from a target speech recording. We evaluate the proposed method using the German Kiel Corpus and furthermore investigate speech-related properties of the proposed feature.

[1]  Frank Kurth,et al.  A construction of compact MFCC-type features using short-time statistics for applications in audio segmentation , 2009, 2009 17th European Signal Processing Conference.

[2]  Caren Brinckmann,et al.  THE “ KIEL CORPUS OF READ SPEECH ” AS A RESOURCE FOR PROSODY PREDICTION IN SPEECH SYNTHESIS , 2005 .

[3]  Meinard Müller,et al.  Efficient Index-Based Audio Matching , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Chung-Hsien Wu,et al.  Utterance verification using prosodic information for Mandarin telephone speech keyword spotting , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[5]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Mark D Skowronski,et al.  Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition. , 2004, The Journal of the Acoustical Society of America.

[7]  Samy Bengio,et al.  Discriminative keyword spotting , 2009, Speech Commun..