Nonlinear speech analysis and acoustic model adaptation with applications to stress classification and speech recognition

The problem of effective speech and speaker modeling for robust speech recognition remains a challenging goal. This thesis focuses on the three related areas of (1) nonlinear speech analysis, (2) speaker stress classification, and (3) recognizer model adaptation, in an effort to improve speech recognition performance. First, four novel nonlinear Teager Energy Operator (TEO) based features are proposed for assessing speaker variability. These features include variation of TEO-decomposed frequency modulation component (TEO-FM-Var), pitch estimation based on the TEO profile (TEO-Pitch), four-band (TEO-Auto-Env) and critical-band partition-based TEO autocorrelation envelope (TEO-CB-Auto-Env). The motivation for considering TEO-based processing is that speech production variability is believed to be a nonlinear phenomenon based on previous experiments of the airflow pattern in the vocal tract. Secondly, we choose speech under stress as an example of speaker variability and evaluate our nonlinear TEO-based features for stress detection and classification. Extensive evaluations are conducted using a range of simulated and actual stressed speech data from the SUSAS and NATO SUSC-0 corpora, and 911 emergency telephone data. Results show the TEO-CB-Auto-Env feature to be the best for stress classification compared with other TEO-based and traditional speech features in terms of accuracy and consistency. TEO-CB-Auto-Env is also shown to be promising for speaker stress assessment. Finally, two new transformation-based (one linear, one nonlinear) mixture weight adaptation schemes for the Gaussian mixture HMM parameters are proposed for recognizer adaptation to an unseen speaker or environmental condition, such as speaker variability due to stress. These schemes come at a significantly reduced computational cost compared to MLLR mean adaptation. The Sphinx-III speech recognition system was employed, with training and test using three speech corpora: Wall Street Journal, Broadcast News, and data from the National Gallery of the Spoken Word. Evaluation results show that the linear transform mixture weight scheme is promising if an improved state clustering method is used and cluster specific adaptation strategies are applied. The three contributions made in this thesis extend our knowledge of how to analyze and classify speaker characteristics under variabilities such as stress, as well as ways to potentially integrate such knowledge into large vocabulary speech recognition systems based on model adaptation.