An utterance recognition technique for keyword spotting by fusion of bark energy and MFCC features

This paper describes the preliminary results of a keyword spotting system using a fusion of spectral and cepstral features. Spectral energy in 16 bands of frequencies on Bark scale and 16 mel-scale warped cepstral coefficients are used independently and in combination with appropriate weights for recognizing word utterances. Results of matching features using Euclidean and cosine distances in a dynamic time warping (DTW) process demonstrate that cosine distance works better for Bark energy features while weighted Euclidean distance brings out the closeness of utterances in the cepstral domain. In both cases, performance of DTW shows an accuracy of better than 81 percent for different speakers while fusion of the two feature sets raises the score to over 86 per cent, both based on a small subset of utterances from the Call Home database.