Digit recognition with stochastic perceptual speech models
暂无分享,去创建一个
We have recently developed a statistical model of speech that focuses statistical modeling power on phonetic transitions. These are the perceptually-dominant and information-rich portions of the speech signal, which may also be the parts of the speech signal with a better chance to withstand adverse acoustical conditions. We describe here some of the concepts, along with some preliminary experiments on digit recognition. These experiments show that the new models, when used in combination with our more standard models, can signiicantly improve performance in the presence of noise. 1. BACKGROUND In 5] we reported the development of a statistical model of speech that incorporates some simple temporal properties of speech perception. The primary goal of this theoretical development was to avoid a number of current constraining assumptions for statistical speech recognition systems, particularly the model of speech as a sequence of stationary segments consisting of uncorrelated acoustic vectors. In the new model, speech was viewed from the perceiving side as a sequence of Auditory Events (Avents), which are elementary decisions that occur at some point when the spectrum and amplitude are rapidly changing (as in 3]). Avents are presumed to occur about once per phone boundary, and thus are modeled as being separated by relatively stationary periods (ca. 50-150 ms). The statistical model uses these Avents as fundamental building blocks for words and utterances, separated by states corresponding to the more stationary regions. In order to focus the statistical power on the rapidly-changing portions of the time series, all of the stationary regions are tied to the same non-Avent class. Markov-like recognition models use Avents as time-asynchronous observations. Discrimi-nant models are trained to distinguish among all classes (including the non-Avent class). In the full embedded procedure , the training data is automatically aligned using dynamic programming, and the discriminant system (e.g., a neural network) is trained on the new segmentation. These two steps are iterated, as discussed in 2], and are guaranteed to converge to a local minimum of the probability of error (on the training set). This process should focus modeling power on the perceptually-dominant and information-rich portions of the speech signal, which may also be the parts of the speech signal with a better chance to withstand adverse acoustical conditions. We named this new framework the Stochastic Perceptual Auditory-event-based (Avent) Model, or SPAM. Figure 1 shows a SPAM for the word \six". Note that all of the states with …
[1] Steven Greenberg,et al. Stochastic perceptual auditory-event-based models for speech recognition , 1994, ICSLP.
[2] S. Furui. On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.
[3] Yochai Konig,et al. REMAP: Recursive Estimation and Maximization of A Posteriori Probabilities - Application to Transition-Based Connectionist Speech Recognition , 1995, NIPS.