A robust keyword spotting system for Persian conversational telephone speech using feature and score normalization and ARMA filter

Keyword spotting (KWS) refers to detection of a limited number of given keywords in speech utterances. In this paper, we evaluate a robust keyword spotting system based on hidden markov models for speaker independent Persian conversational telephone speech. Performance of base line keyword spotter is improved by means of normalizing features using cepstral mean and variance normalization (CMVN) and cepstral gain normalization (CGN). And better performance is gained by applying auto-regressive moving average (ARMA) filter on normalized features. Experimental results show that although all these methods improve keyword spotting performance, CMVN and ARMA (MVA) processing of PLP features works much better on our Persian conversational telephone speech database and 41% improvement to baseline system is achieved at false alarm (FA) rate equal to 8.6 FA/KW/Hour.

[1]  Olli Viikki,et al.  A recursive feature vector normalization approach for robust speech recognition in noise , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Lukás Burget,et al.  Phoneme Based Acoustics Keyword Spotting in Informal Continuous Speech , 2005, TSD.

[3]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[4]  Mahmood Bijankhan,et al.  Tfarsdat - the telephone farsi speech database , 2003, INTERSPEECH.

[5]  Jeff A. Bilmes,et al.  Low-resource noise-robust feature post-processing on Aurora 2.0 , 2002, INTERSPEECH.

[6]  Victor Zue,et al.  A segment-based wordspotter using phonetic filler models , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Naoya Wada,et al.  Cepstral gain normalization for noise robust speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Luboš Šmídl,et al.  Keyword spotting with triphone based filler model , 2005 .