Vocal tract length estimation for voiced and whispered speech using gammachirp filterbank

In this paper, we demonstrate an auditory spectrogram based on a dynamic compressive gammachirp filterbank (GCFB) that enables accurate and robust estimation of vocal tract length (VTL) for both voiced and whispered speech. Normalized VTLs of 21 speakers were derived by using the least squared analysis of their VTL ratios (for all permutations, 420 = 21P20) which were estimated by minimizing spectral distances in the auditory spectrograms. The frequency range was selected in the calculation and the range between 500 and 5000 (Hz) was most reasonable for both speech mode. The method based on GCFB was better than that based on the mel-frequency filterbank (MFFB). The estimated VTLs were compared with the VTL data measured in MRI to confirm the reliability.

[1]  IrinoToshio,et al.  Segregating information about the size and shape of the vocal tract using a time-domain auditory model , 2002 .

[2]  Hideki Kawahara,et al.  Auditory Filterbank Improves Voice Morphing , 2011, INTERSPEECH.

[3]  Hideki Kawahara,et al.  Temporally variable multi-aspect auditory morphing enabling extrapolation without objective and perceptual breakdown , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Roy D. Patterson,et al.  A Dynamic Compressive Gammachirp Auditory Filterbank , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Hideki Kawahara,et al.  Detecting child speaker based on auditory feature vectors for VTL estimation , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[6]  W. Fitch,et al.  Morphology and development of the human vocal tract: a study using magnetic resonance imaging. , 1999, The Journal of the Acoustical Society of America.

[7]  Roy D. Patterson,et al.  Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform , 2002, Speech Commun..

[8]  Toshio Irino,et al.  Evaluation of voice morphing using vocal tract length normalization based on auditory filterbank (Special Section on Papers Awarded the Student Paper Award at NCSP'11) , 2011 .

[9]  Roy D. Patterson,et al.  Comparison of performance with voiced and whispered speech in word recognition and mean-formant-frequency discrimination , 2012, Speech Commun..

[10]  Richard E. Turner,et al.  The processing and perception of size information in speech sounds. , 2005, The Journal of the Acoustical Society of America.

[11]  K Honda,et al.  Acoustic characteristics of the piriform fossa in models and humans. , 1997, The Journal of the Acoustical Society of America.