The CUHK Spoken Web Search System for MediaEval 2013

This paper describes an audio keyword detection system developed at the Chinese University of Hong Kong (CUHK) for the spoken web search (SWS) task of MediaEval 2013. The system was built only on the provided unlabeled data, and each query term was represented by only one query example (from the basic set for required runs). This system was designed following the posteriorgram-based template matching framework, which used a tokenizer to convert the speech data into posteriorgrams, and then applied dynamic time warping (DTW) for keyword detection. The main features of the system are: 1) a new approach of tokenizer construction based on Gaussian component clustering (GCC) and 2) query expansion based on the technique called pitch synchronous overlap and add (PSOLA). The MTWV and ATWV of our system on the SWS2013 Evaluation set are 0.306 and 0.304.

[1]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[2]  Bin Ma,et al.  Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Timothy J. Hazen,et al.  Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[4]  Florian Metze,et al.  The Spoken Web Search Task , 2012, MediaEval.

[5]  Bin Ma,et al.  An acoustic segment modeling approach to query-by-example spoken term detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Bin Ma,et al.  Unsupervised mining of acoustic subword units with segment-level Gaussian posteriorgrams , 2013, INTERSPEECH.