Word and subword modelling in a segment-based HMM word spotter using a data analytic approach

In this work we focus on methods for representing acoustic-phonetic knowledge in a speech recognizer and for analyzing the system's behavior in detail. The testbed for developing these methods is a segment-based hidden Markov model (HMM) recognizer. The HMM framework is used to model the segmenter's deviations from the ideal behavior of one segment per phone. We employ an HMM topology that allows a phone to be associated with more than one segment. Biphone HMM's model instances in which a segment is associated with more than one phone. We compared the effectiveness of various segment measurement sets on a phonetic recognition task. The measurements consisted of short-time spectral representations measured at particular positions relative to segment boundaries. The key result was that the addition of spectra measured outside the segment to those measured inside led to a significant improvement in performance. In the course of investigation methods for representing knowledge in the measurement sets, we built linear regression models to estimate $F\sb1$ and $F\sb2$ from a set of mel-frequency spectral coefficients (MFSC's). We show that such a model is inadequate for predicting formant values at the ends of their observed ranges. However, by adding nonlinear transformations of the MFSC's to the regressor set, highly-accurate models $(R\sp2 > .96)$ valid for more than 80% of observed formant frequencies could be built. We also developed a technique we term grouped multiple discriminant analysis to address the fact that within-class covariance varies greatly among phones, contrary to the assumptions of conventional multiple discriminant analysis. We used the segment-based HMM to investigate word modelling issues as well. Models were compared using a word spotting task. The models varied along three dimensions: training method, type of pronunciation network, and measurement set. In the course of this investigation, we developed novel algorithms for word spotter scoring and performance evaluation. The scoring algorithm determines the beginning and end points of a presumed keyword and computes an estimate of the probability that the keyword occurred between those points. Finally, we outline the philosophy of exploratory data analysis and discuss how the methodology can be employed in the design of speech recognizers. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.) (Abstract shortened by UMI.)