Speech recognition using connectionist networks

The use of connectionist networks for speech recognition is assessed using a set of representative phonetic discrimination problems. The problems are chosen with respect to the physiological theory of phonetics in order to give broad coverage to the space of articulatory phonetics. Separate network solutions are sought to each phonetic discrimination problem. A connectionist network model called the Temporal Flow Model is defined which consists of simple processing units with single valued outputs interconnected by links of variable weight. The model represents temporal relationships using delay links and permits general patterns of connectivity including feedback. It is argued that the model has properties appropriate for time varying signals such as speech. Methods for selecting network architectures for different recognition problems are presented. The architectures discussed include random networks, minimally structured networks, hand crafted networks and networks automatically generated based on samples of speech data. Networks are trained by modifying their weight parameters so as to minimize the mean squared error between the actual and the desired response of the output units. The desired output unit response is specified by a target function. Training is accomplished by a second order method of iterative nonlinear optimization by gradient descent which incorporates a method for computing the complete gradient of recurrent networks. Network solutions are demonstrated for all eight phonetic discrimination problems for one male speaker. The network solutions are analyzed carefully and are shown in every case to make use of known acoustic phonetic cues. The network solutions vary in the degree to which they make use of context dependent cues to achieve phoneme recognition. The network solutions were tested on data not used for training and achieved an average accuracy of 99.5 $\pm$ 0.4%. Methods for extending these results to a single network for recognizing the complete phoneme set from continuous speech obtained from different speakers are outlined. It is concluded that acoustic phonetic speech recognition can be accomplished using connectionist networks.