This paper presents a new approach for formant tracking using a parameter-free non-linear predictor that maps formant frequencies and bandwidths into the acoustic feature space. The approach relies on decomposing the speech signal into two components: the first component captures the mapping between formants and acoustic observations, while the second component is intended to capture the residual in the signal. We build the mapping by quantizing the formant space and creating a predictor codebook. Formant tracking is achieved by: (1) EM training of the parameters of the residual component, and (2) searching the predictor codebook for the best formant values. We explore both MAP and MMSE methods for performing formant tracking with the proposed approach. Furthermore, we impose first order continuity constraints on formant trajectories, and use Viterbi search to perform formant tracking. We present formant tracking results on data from the Switchboard corpus.
[1]
Li Deng,et al.
A statistical coarticulatory model for the hidden vocal-tract-resonance dynamics
,
1999,
EUROSPEECH.
[2]
Philip N. Garner,et al.
Using formant frequencies in speech recognition
,
1997,
EUROSPEECH.
[3]
Alex Acero,et al.
Formant analysis and synthesis using hidden Markov models
,
1999,
EUROSPEECH.
[4]
D. Rubin,et al.
Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper
,
1977
.
[5]
David B. Pisoni,et al.
Text-to-speech: the mitalk system
,
1987
.
[6]
L Deng,et al.
Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics.
,
2000,
The Journal of the Acoustical Society of America.