A new neural network for articulatory speech recognition and its application to vowel identification

Abstract A system for automatic speech recognition (ASR) based on a new neural network design and a theory of articulatory phonology is presented. This system operates in two stages. In the first, speech acoustics are mapped by a neural network onto the movements of the tongue and lips that produced those acoustics (the neural networks are trained on X-ray microbeam recordings of actual articulatory movements); in the second stage, gestures are recovered from those movements. The neural network is built around a new objective function, Correlational + Scaling Error (COSE). When compared to a traditional neural network system, the COSE system trains faster, produces output which better represents the shape of the articulatory movements, and yields higher recognition rates for vowel gestures. After training on two speakers, recognition rates up to 96% for tokens from the training set and 87% for tokens spoken by a novel speaker were achieved.