Non-linguistic factors such as morphological differences in vocal tracts inevitably affect acoustic features of speech. Recently, a new speech representation, called as structural representation, was proposed which is completely independent of these factors. In the representation, the absolute property of speech events is totally discarded and their relative property is only captured and modeled. In the previous studies, all the discussions on this new representation were done using cepstrum-based features. In this report, spectrum-based features are used for the structural representation and tested for speech recognition. Mathematical and experimental discussions show the followings. 1) The spectrum-based structural representation also has strong speaker-invariance. 2) It can show a better performance of noisy speech recognition compared to cepstrum-based structures. 3) It shows a rather similar performance to humans when noise vocoded speech samples are tested. Finally, we discuss the validity of the spectrum-based structural speech recognition as a model of human speech perception.
[1]
Hideki Kawahara,et al.
STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds
,
2006
.
[2]
Hermann Ney,et al.
Vocal tract normalization equals linear transformation in cepstral space
,
2001,
IEEE Transactions on Speech and Audio Processing.
[3]
Nobuaki Minematsu.
Mathematical evidence of the acoustic universal structure in speech
,
2005,
Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..
[4]
Roger K. Moore.
A comparison of the data requirements of automatic speech recognition systems and human listeners
,
2003,
INTERSPEECH.