Phonological feature based variable frame rate scheme for improved speech recognition

In this paper, we propose a new scheme for variable frame rate (VFR) feature processing based on high level segmentation (HLS) of speech into broad phone classes. Traditional fixed-rate processing is not capable of accurately reflecting the dynamics of continuous speech. On the other hand, the proposed VFR scheme adapts the temporal representation of the speech signal by tying the framing strategy with the detected phone class sequence. The phone classes are detected and segmented by using appropriately trained phonological features (PFs). In this manner, the proposed scheme is capable of tracking the evolution of speech due to the underlying phonetic content, and exploiting the non-uniform information flow-rate of speech by using a variable framing strategy. The new VFR scheme is applied to automatic speech recognition of TIMIT and NTIMIT corpora, where it is compared to a traditional fixed window-size/frame-rate scheme. Our experiments yield encouraging results with relative reductions of 24% and 8% in WER (word error rate) for TIMIT and NTIMIT tasks, respectively.

[1]  Richard M. Stern,et al.  Speech recognition in mobile environments , 2000 .

[2]  Abeer Alwan,et al.  Entropy-based variable frame rate analysis of speech signals and its application to ASR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Douglas D. O'Shaughnessy,et al.  Statistical properties of the warped discrete cosine transform cepstrum compared with MFCC , 2005, INTERSPEECH.

[4]  Florian Metze,et al.  A flexible stream architecture for ASR using articulatory features , 2002, INTERSPEECH.

[5]  Julien Epps,et al.  An energy search approach to variable frame rate front-end processing for robust ASR , 2005, INTERSPEECH.

[6]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[7]  Simon King,et al.  Articulatory feature recognition using dynamic Bayesian networks , 2007, Comput. Speech Lang..

[8]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[9]  John H. L. Hansen,et al.  A new perspective on feature extraction for robust in-vehicle speech recognition , 2003, INTERSPEECH.

[10]  Lou Boves,et al.  Feature vector selection to improve ASR robustness in noisy conditions , 2001, INTERSPEECH.

[11]  K. Sonmez,et al.  Multirate ASR models for phone-class dependent N-best list rescoring , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[12]  P Le Cerf,et al.  A new variable frame analysis method for speech recognition , 1994 .

[13]  Wei-Ping Zhu,et al.  Design and Performance Analysis of Bayesian, Neyman–Pearson, and Competitive Neyman–Pearson Voice Activity Detectors , 2007, IEEE Transactions on Signal Processing.