Learning Speech as Acoustic Sequences with the Unsupervised Model, TOM

Most connectionist systems do not involve the temporal dimension. However, some neural networks attempt to take time into account either inside or outside the network. Speech recognition is a stochastic problem that involves a dynamic parameter. The recognitionof a unit of speech depends on the contextual information i.e., the parts around it. Our approach consists of interpreting the speech signal as a sequence of acoustic informationand designing a neural network, called TOM or Temporal Organization Map, in order to learn these sequences. The basis of TOM is a self-organized map containing super-units which are closer to the cortical column model than the MacCulloch and Pitts formal neuron. These super-units can intrinsically differentiate the stimuli according to the contextual information. Temporal distortion is taken into account by the network and the results obtained with this system, which does not include a priori knowledge on the speech signal, are interesting. We present an application to French and English spoken digit recognition.