SPEECHCODINGUSINGMIXTUREOFGAUSSIANSPOLYNOMIALMODELParham ZolfaghariyTony RobinsonCREST/ATR Human Information Pro cessing Research Labs, Kyoto 619-02, Japanemail :zparham@hip.atr.co.jpyCambridge University Engineering Department,Cambridge CB2 1PZ, UKemail :ajr@eng.cam.ac.ukABSTRACTWehaveinestigated a noel metho d of sp ectral estimationbased on mixture of Gaussians in a sinusoidal analysis andsynthesisframework.Afterquantisationofthisparamet-ric scheme a xed frame-rate co der op erating at a bit-rate ofaround 2.4 kbits/s has b een develop ed. This pap er describ esanextensiontothissp ectralmo delbasedonconstrainingthe parameters of the mixture of Gaussians to b e on a p oly-nomialtra jectoryoverasegmentofsp eechdata.ThisisreferredtoasthemixtureofGaussiansp olynomialmo del(MGPM).Inordertorealiseasegmentalco der,dynamicprogramming over the utterance is p erformed.The segmen-talrepresentationofthe sp ectraresultsinalog-likeliho o dscoreover a segment which is used as the cost function inthe dynamic programming algorithm.Sp eech co ding com-p onents suchaspitch,voicingand gainaredescrib edseg-mentally.Anumb er of segmental co ders are presented withbit-rates in the range of 350 to 650 bits/s.These co ders of-fer go o d and intelligible co ded sp eechevaluated using DRTscoring at these bit-rates.1.INTRODUCTIONA segmental framework employs the inter-frame or time de-p endenceofthesp ectralrepresentation.Thisdep endenceis inherentinvarious segments of sp eech, such as sustainedvowels, as the sp eech sp ectral enelop e is a slow time-varyingpro cess and sp ectra of adjacent frames are highly correlated.Variousformsofsegmentationmo delshaveb eenappliedtosp eechco dingandrecognition.Inco d-ingRoucosetal[11 ]describ eaverylowbit-ratesegmentvocoderop eratingat 150bits/sfora singlesp eaker.Thislow rate is achieved byvector quantisation (VQ) of all theLPC sp ectra in a segment as a single unit. The Kang-Coulter600bits/svocoder[6 ]alsousesLPCmetho ds followedbyformant tracking to pro duce go o d quality sp eech with a re-p orted DRT score of 79.9.These low bit-rates can also b eachieved by a recognition-based approach where recognitionunits are co ded.Holmes [5 ] has describ ed a metho d whichuses an underlying linear-tra jectory formant mo del for b othrecognition and synthesis.Thecontributionofthisworkistomo deltheenvelop eoftheshort-termpowersp ectraldensityasamixtureGaussians [13 ].In this framework a Gaussian roughly corre-sp onds to a formant with the Gaussian mean corresp ondingto the formant frequency and the variance corresp onding tothe bandwidth.Thismo del wasintegrated in a sinusoidalmo delbasedsp eechco dingscheme[14 ].Anadvantageofthis frameworkis that a sp eech segmentmay b e mo delledusing a p olynomial tra jectoryfor the Gaussianmeans andvariances.Wehae previously rep orted on a segmental co derusing a linear p olynomial tra jectory for the Gaussian mix-tures op erating b etween 600-800 bits/s [15 ].We extend thismo del to an R'th order p olynomial to represent b oth meansandvariancesofthe Gaussians.In the sp eechrecognitionarea, similar mo dels have also b een implemented for MFCCtra jectories in a HMM-based system [4].2.SEGMENTAL CODER STRUCTUREThe blo ck structure of the co ders describ ed in this pap er isas shown in Figure 1.A sinusoidal mo del framework basedonthe ideasofMcAulayandQuatieri[8 ]isused.Inthismo del, the sp eech signal is represented by a harmonic set ofpartials with varying amplitudes and frequencies.In accor-dance with our desire to build a very low bit-rate co der werestrict the sine waves to b e harmonically related.The in-verse FFT metho d of re-synthesis [3 ] is used and the phase ofeach harmonic is chosen at reconstruction time to minimisethe mismatch with the previous frame.TheSp ectralEnvelop eEstimationVocoder(SEEVOC)envelop e,devised byPaul [9] uses a robust p eak detectionalgorithm to yield a smo oth envelop e as the underlying sp ec-tral representation.In order to op erate in the low bit-rateregion, the SEEVOC envelop e needs to b e eciently co ded.We aid the mixture of Gaussians p olynomial mo del to rep-resent this sp ectra over a segment.Polynomial least squares
[1]
Kuldip K. Paliwal,et al.
Model parameter estimation for mixture density polynomial segment models
,
1998,
Comput. Speech Lang..
[2]
Richard M. Schwartz,et al.
A segment vocoder at 150 b/s
,
1983,
ICASSP.
[3]
Parham Zolfaghari,et al.
A formant vocoder based on mixtures of Gaussians
,
1997,
1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[4]
Thomas F. Quatieri,et al.
Magnitude-only reconstruction using a sinusoidal speech modelMagnitude-only reconstruction using a sinusoidal speech model
,
1984,
ICASSP.
[5]
George S Kang,et al.
600-Bit-Per-Second Voice Digitizer (Linear Predictive Formant Vocoder).
,
1976
.
[6]
Heekuck Oh,et al.
Neural Networks for Pattern Recognition
,
1993,
Adv. Comput..
[7]
Parham Zolfaghari,et al.
A segmental formant vocoder based on linearly varying mixture of Gaussians
,
1997,
EUROSPEECH.
[8]
D. Rubin,et al.
Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper
,
1977
.
[9]
D. Paul.
The spectral envelope estimation vocoder
,
1981
.
[10]
Parham Zolfaghari,et al.
Formant analysis using mixtures of Gaussians
,
1996,
Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.
[11]
Steve Renals,et al.
WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition
,
1995,
1995 International Conference on Acoustics, Speech, and Signal Processing.