Acoustic segmentation using switching state Kalman filter

Segmenting the acoustic signal in the TIMIT database by a switching state Kalman filter model is reported in this paper. According to the assumption that the high dimensional acoustic feature vector of the LSF (line spectrum frequency) of the speech signal is probably embedded in a low dimensional space, a two dimensional vector is used to represent the continuous state vector in this model. The parameters of the model are initialized by PPCA (probabilistic principal component analysis) and first order vector autoregression, and are re-estimated by the EM algorithm. We show that this model can be used to classify vowels, nasals, frication and silence by an approximate Viterbi inference.

[1]  Abeer Alwan,et al.  Speech Coding: Fundamentals and Applications , 2003 .

[2]  L Deng,et al.  Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics. , 2000, The Journal of the Acoustical Society of America.

[3]  Li Deng,et al.  Parameter estimation of a target-directed dynamic system model with switching states , 2001, Signal Process..

[4]  Vladimir Pavlovic,et al.  A dynamic Bayesian network approach to figure tracking using learned dynamic models , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[5]  Mari Ostendorf,et al.  ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition , 1993, IEEE Trans. Speech Audio Process..

[6]  P. Ladefoged,et al.  Factor analysis of tongue shapes. , 1971, Journal of the Acoustical Society of America.

[7]  Marco Gori,et al.  Adaptive Processing of Sequences and Data Structures , 1998, Lecture Notes in Computer Science.

[8]  Mark Hasegawa-Johnson,et al.  Analysis of the three-dimensional tongue shape using a three-index factor analysis model. , 2003, The Journal of the Acoustical Society of America.

[9]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .