Recognizing isolated words with minimum distance similarity metric padding

Automated processing and recognition of human speech commands under unconstrained and noisy recognition situations with a limited number of training samples is a challenging problem of interest to smart devices and systems. In practice, it is impossible to remove noise without losing class discriminative information in the speech signals. Also, any attempts to improve signal quality place an additional burden on the computational capacity in state-of-the-art speech command recognition systems. In this paper, we propose a low-level word processing system using mean-variance normalised frequency-time spectrograms and a new similarity measure that compensates for feature length mismatches such as those resulting from pronunciation variations in speech segments. We find that padding a local similarity matrix with zero similarity values to disregard the effects of a mismatch in length of speech spectrograms results in improved word recognition accuracies and reduction in between class non-discriminative signals. As opposed to the state-of-the-art approaches in spectrogram comparisons such as DTW, the proposed method, when tested using the TIMIT database, shows improved recognition accuracies, robustness to noise, lower computational requirements, and scalability to large word problems.

[1]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[2]  R. Shepard,et al.  Toward a universal law of generalization for psychological science. , 1987, Science.

[3]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[4]  Alex Waibel,et al.  Readings in speech recognition , 1990 .

[5]  M. Portnoff Short-time Fourier analysis of sampled speech , 1981 .

[6]  Alex Pappachen James,et al.  Face Recognition Using Local Binary Decisions , 2008, IEEE Signal Processing Letters.

[7]  L. R. Rabiner,et al.  A comparative study of several dynamic time-warping algorithms for connected-word recognition , 1981, The Bell System Technical Journal.

[8]  Daniel P. W. Ellis,et al.  Ground-truth transcriptions of real music from force-aligned MIDI syntheses , 2003, ISMIR.

[9]  Mohamed Chtourou,et al.  Efficient MLP constructive training algorithm using a neuron recruiting approach for isolated word recognition system , 2011, Int. J. Speech Technol..

[10]  Steve Young A review of large-vocabulary continuous-speech , 1996 .

[11]  Alex Pappachen James,et al.  Inter-image outliers and their application to image classification , 2010, Pattern Recognit..

[12]  Steve Young,et al.  A review of large-vocabulary continuous-speech , 1996, IEEE Signal Process. Mag..

[13]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.