论文信息 - VTLN based on the linear interpolation of contiguous mel filter-bank energies

VTLN based on the linear interpolation of contiguous mel filter-bank energies

This paper describes a novel feature-space VTLN method that models frequency warping as a linear interpolation of contiguous Mel filter-bank energies. The presented technique aims to reduce the distortion in the Mel filter-bank energy estimation due to the harmonic composition of voiced speech intervals and DFT sampling when the central frequency of band-pass filters is shifted. The presented interpolated filterbank energy-based VTLN leads to relative reductions in WER as high as 11.2% and 7.6% when compared with the baseline system and standard VTLN, respectively, in a mediumvocabulary continuous speech recognition task. Also, this new scheme provides significant reductions in WER equal to 7% when compared with state-of-the-art VTLN methods based on linear transforms in the cepstral space. The warping factor estimated here shows more dependence on the speaker and more independence of the acoustic-phonetic content than the warping factor in state-of-the-art VTLN techniques.

Néstor Becerra Yoma | Claudio Garretón | Fernando Huenupán | Ignacio Catalan | Jorge Wuth

[1] Li Lee,et al. A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[2] Abeer Alwan,et al. Adaptation of children's speech with limited data based on formant-like peak alignment , 2006, Comput. Speech Lang..

[3] Steve Young,et al. Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[4] Hideki Kawahara,et al. Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5] Richard M. Stern,et al. Robust speech recognition by normalization of the acoustic space , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[6] Sven Behnke,et al. Pitch Estimation using Models of Voiced Speech on Three Levels , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7] Srinivasan Umesh,et al. A Study of Filter Bank Smoothing in MFCC Features for Recognition of Children's Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8] Charles Elkan,et al. Expectation Maximization Algorithm , 2010, Encyclopedia of Machine Learning.

[9] Louis ten Bosch,et al. A novel feature transformation for vocal tract length normalization in automatic speech recognition , 1998, IEEE Trans. Speech Audio Process..

[10] Philip C. Woodland,et al. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[11] Néstor Becerra Yoma,et al. On Reducing Harmonic and Sampling Distortion in Vocal Tract Length Normalization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[12] Fabio Brugnara,et al. Improved automatic speech recognition through speaker normalization , 2006, Comput. Speech Lang..

[13] Abeer Alwan,et al. Speaker Adaptation With Limited Data Using Regression-Tree-Based Spectral Peak Alignment , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14] Abeer Alwan,et al. Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC , 2009, Comput. Speech Lang..

[15] S. Wegmann,et al. Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[16] Hermann Ney,et al. Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[17] Mark J. F. Gales,et al. Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[18] Hermann Ney,et al. Implementing frequency-warping and VTLN through linear transformation of conventional MFCC , 2005, INTERSPEECH.

[19] Herbert Gish,et al. A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[20] William J. Byrne,et al. Speaker adaptation with all-pass transforms , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[21] Hermann Ney,et al. Revisiting VTLN using linear transformation on conventional MFCC , 2010, INTERSPEECH.