Discovering an optimal set of minimally contrasting acoustic speech units: a point of focus for whole-word pattern matching

This paper presents a computational model that can automati­ cally learn words, made up from emergent sub-word units, with no prior linguistic knowledge. This research is inspired by cur­ rent cognitive theories of human speech perception, and there­ fore strives for ecological plausibility with the desire to build more robust speech recognition technology. Firstly, the par­ ticulate structure of the raw acoustic speech signal is derived through a novel acoustic segmentation process, the ‘acoustic DP-ngram algorithm’. Then, using a cross-modal association learning mechanism, word models are derived as a sequence of the segmented units. An efficient set of sub-word units emerge as a result of a general purpose lossy compression mechanism and the algorithms sensitivity to discriminate acoustic differ­ ences. The results show that the system can automatically derive robust word representations and dynamically build re­ usable sub-word acoustic units with no pre-defined languagespecific rules. Index Terms: speech perception, segmentation, classification