Learned phonetic discrimination using connectionist networks

A method for learning phonetic features from speech data using a temporal flou model is described, in which sampled speech data flows through a connectioiist network irom input to output units' The network uses hidden units with recurrent links to capture spectral/temporal characteristics of phonetic features. A simple experiment to discriminate the consonants p,a,g] in the context of [i,a,u] using CV words is described. A supervised learning algorithm is used *,hich performs gradient descent using a coarse approximation ofthe desired outpuias an target function. Context-dependent internal representations (features) were formed in the process oflearning the discrimination task. Ä second experiment demonstrating learned vowel discrimination in various consonant environments is also presented. Both discrimination ta.sks rvere performed successfully without segmentation of the input, and uritÄout a direct comparison o! the troining items. INTRODUCTION The connectionist network approach to speech recognition is attractive because it offers a computational model which is well matched to the biological architecture that has served as their paradigm. Their learning capabilities, robust behavior, noise tolerance and graceful degradation .r" uil .up.-bilities which are becoming increasingly well understood and documented. The networks consist of simple processing elements which integrate their inputs and broadcast the results to the units to which they are connected. Thus, the network response to input is the aggregate response of many interconnected units. It is the mutual interaction of many simple components that is the basis for robustness. The perception of speech depends on the correct analysis of dynamic temporal/spectral relationships. The problem of designing connectionist networks which can learn these dynamic spectral/tempoial characteristics has not yet been widely studied. Learning to associate static input/tutput pairs can be accomplished with layered connectionist networks with feedforward links alone. But recurrent, or feedback, links are required to provide the network with state sequence information, in order to capture sequential behavior. A previous experiment showed that a simple network with recurrent links could be trained on a single instance ofthe word pair "no" and "go", and correctly discriminate g8% of2b other tokens o[each word for the same speaker [3]. The experiment was repeated for a second speaker and resulted in 100% discrimination performance. An experiment is reported here which shows that connectionist networks can be optimized to discriminate the voiced stop consonants, [b,d,B], in various vowel contexts. A second experiment demonstrates the discrimination of the vowels [i,a,u] in the environment of various stop consonants. The results of these experiments show that connectionist networks can be designed and trained to successfutly discriminate similar word pairs by learning context-dependent acousiic-phonetic features. EXPERIMENT The first experiment was designed to learn stop consonant discrimination in different vowel contexts, using CV words. The experiment used the voiced stops, [b,d,g] in three vowel contexts, [i,a,u]. i second experiment was designed to learn vowel discrimination in different consonant environmentr, using the same CV data. For these experiments, a three-layer temporal flow model was implemented, as shown in Figure 1, with three output units, a variable number of hidden units, and 16 input units. The hidden and äutput units had self-recurrent links. The functions which define the unit behavior were chosen to approximate the computational properties of neural cells, and have convenient mathematical properties ior the learning algorithm used in this experiment [2]. The unit output is a sigmoid function of the unit potential, which is the weighted sum of the outputs of the afferent units. lsiemens Corp. Research, 105 College Road Eret, princeton,NJ og540 2Univ. of Peruraylvania, Computer md Information Seiences, ph.ila-, pA lg1o4 3Univ. of Penneylvania, Computer and Information Sciences { ATR Intern.tional Hi gashiku Osaka S.lO, Japan Originally appeared in European Conference on Speech Technology, pp 377-3}O,Edinburgh (19g7). 410 Connectionist Approaches Figure 1: "Temporal FIow Model showing input, hidden and output layers" The speech data used for these experiments consisted of isolated consonant-vowel (CV) utterances for asinglespeaker (RW) consistingof the stop consonants [b,d,g] in combination witir the vowels [i,a,u]. Five repetitions of each CV word for a total of forty-five utterances were spoken into a commercial speech recognition device (Siemens CSE 1200), where it was passed through a L6-channel filter bank, full-wave rectified, Iog compressed and sampled every 2.5 milliseconds. The data files were segmented by hand to extract the segmentation boundary was set at a point of silence and the ffnal segment boundary in the center of the decrease the computational load on the optimization identify the consonant-vowel boundary. It is certain information remained in the segmented data. For these experiments, the Broyden-Fletcher-Goldfarb-Shanno optimization algorithm (BFGS) was used [1]. This algorithm combines a Iinear search along a minimizing vector with an approximation of the second-derivative of the objective function /. In this way, knorvledge about the structure of the error surface is used to select optimal search directions and achieve much more rapid convergence, especially in the neighborhood of the function minima. The algorithm was used to modify the unit connection weights in order to minimize the mean squared error betrveen the actual and desired output values [3]. The ta.rget function for the output units consisted of a simple Gaussian function, u'ith a variable center point and sharpness parameter. This represented the intuition that evidence for a particular phonetic category reaches a peali near some critical point in time. For the consonant experiment, the release of the stop closure was tlre critical event, which occurred roughly in the center of the data buffer. For this reason the target function center value rras chosen as 0.5. For the vowel experiment, the Gaussian was shifted so that the maximum was at the end of the buffer (0.9). This corresponded to the intuition that the vowel discrimination reached a maximum tos'ard the vowel center. The computation of the gradient vector was accomplished by an extended form of the back-propagation learning algorithm for networks with recurrent links [2,4]. A randomly initialized network with l6 hidden units n'as optimized for consonant disoimination. The squared error decreased from 2934 to 121 after approximately 500 iterations. The response of the output units for the optimized network can be seen in Figure 2. The output units respond in equal and opposite ways to the input stimuli; in addition, their time response roughly approximates a Gaussian. Since the learned response closell'fits the training function, the network shows very good discrimination between the items of the training set. The response of the network to the other items is analogous to that shown in the figure. The response of the hidden units to the training data rvas also evaluated. An example can be seen in Figure 3, where it will be noticed that the hidden unit response is decidedly context specific. A similarly initialized network, with 10 hidden units, wa-s optimized for vowel discrimination. The transition portion of the CV word. The initial at least 50 ms prior to the consonant release vowel nucleus. This segmentation was done to algorithm and did not involve an attempt to that sufficient if not complete discriminatory ?.4 Learned Phonetic Discrimination Using Connectionist Networks