Using Pitch as Prior Knowledge in Template-Based Speech Recognition

In a previous paper on speech recognition, we showed that templates can better capture the dynamics of speech signal compared to parametric models such as hidden Markov models. The key point in template matching approaches is finding the most similar templates to the test utterance. Traditionally, this selection is given by a distortion measure on the acoustic features. In this work, we propose to improve this template selection with the use of meta-linguistic information as prior knowledge. In this way, similarity is not only based on acoustic features but also on other sources of information that are present in the speech signal. Results on a continuous digit recognition task confirm the statement that similarity between words does not only depend on acoustic features since we obtained 24% relative improvement over the baseline. Interestingly, results are better even when compared to a system with no prior information but a larger number of templates

[1]  J. Markel,et al.  The SIFT algorithm for fundamental frequency estimation , 1972 .

[2]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[3]  Helmer Strik,et al.  Speech is like a box of chocolates... , 2003 .

[4]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[5]  Ronald A. Cole,et al.  New telephone speech corpora at CSLU , 1995, EUROSPEECH.

[6]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[7]  Heiga Zen,et al.  Trajectory modeling based on HMMs with the explicit relationship between static and dynamic features , 2003, INTERSPEECH.

[8]  Jithendra Vepa,et al.  Improving speech recognition using a data-driven approach , 2005, INTERSPEECH.

[9]  Yifan Gong,et al.  Elimination of trajectory folding phenomenon: HMM, trajectory mixture HMM and mixture stochastic trajectory model , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Odette Scharenborg,et al.  Connected digit recognition with class specific word models , 2000 .

[11]  Patrick Wambacq,et al.  Data driven example based continuous speech recognition , 2003, INTERSPEECH.

[12]  S. Goldinger Words and voices: episodic traces in spoken word identification and recognition memory. , 1996, Journal of experimental psychology. Learning, memory, and cognition.

[13]  Pietro Laface,et al.  Connected digit recognition using short and long duration models , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).