Acoustic modeling for spontaneous speech recognition using syllable dependent models

This paperproposesa syllablecontext dependent model for spontaneousspeechrecognition. It is generally assumedthat,sincespontaneouspeechis greatlyaffectedby coarticulation,an acousticmodel featuringa longerrange phonemiccontext is requiredto achieve a high degreeof recognitionaccuracy. This motivated the authorsto investigatea tri-syllable model that takesdifferencesin the precedingand succeedingsyllables into account. Since Japanesesyllablesconsistof eitherasinglevowel or aconsonantandvowel combination,a tri-syllablemodelalways takestheprecedingandsucceedingvowelsthataretheprimary factorsin coarticulationinto account. A tri-syllable model is thuscapableof efficiently representingcoarticulation. The tri-syllable model was trainedusing spontaneousspeech;then, the effectivenessof continuoussyllable recognitionand statisticallanguagemodel-basedcontinuousword recognitionwereevaluated. Comparedto a regular triphonemodelwithout statesharing,it wasfound that the correctsyllableaccuracy of the continuoussyllable recognitionimprovedfrom 64.9%to 66.3%.Theword recognition accuracy for the statistical languagemodelbasedcontinuousword recognitionimproved from 88.4% to 89.2%.