A stochastic framework for articulatory speech recognition

One of the major difficulties in incorporating knowledge of speech production contraints into speech recognition lies in the problem of adequately characterizing the relationship between an articulatory description of speech and its lexical and acoustic counterparts, and in developing procedures for recovering such an articulatory description from acoustic input. In this paper, a new stochastic framework for articulatory speech recognition is presented aimed at addressing these issues. Utterances are described in terms of overlapping gestural units that are built into a Markov state structure. Each gestural combination is identified with a set of acoustic/articulatory correlates embodied in a target distribution on an articulatory parameter space, while articulatory motion is represented by a stochastic linear dynamical system whose parameters and input are indexed on the Markov state. A piecewise‐linear approximation to the articulatory‐acoustic mapping derived from an explicit production model transforms the articulatory distribution into acoustic space. The key innovation of the model is that articulation is described probabilistically as the response of a system to a sequence of target distributions produced by a gestural sequence. Derivation of the articulatory‐acoustic mapping is discussed, and algorithms are proposed for recognition and training based on Kalman filtering and an approximate EM procedure.