Speech recognition using discriminative classifiers

This work concerns the automatic speech recognition (ASR) problem, which roughly speaking, consists in converting digitized speech into text. More specifically, we study front ends and acoustic modeling, which together with language modeling and search, constitute a typical ASR system. The main approach is to investigate how ASR can benefit from recent advances in machine learning. We study feature selection and discriminative classifiers, such as support vector machines (SVM) and other kernel methods. These kernel classifiers achieved state-of-art results in many applications, but early experiments in ASR have exposed some difficulties. One of them is that the classifier's input in ASR is a variable-length vector. A second issue is that SVM and similar classifiers are restricted to binary (two-classes) problems. Another point is that the computational cost of training kernel classifiers can be prohibitive, given that relatively large datasets are used in ASR. We address all these three issues in this work. We use a hybrid framework, where discriminative classifiers are combined to hidden Markov models (HMM), such that the system is able to cope with a variable-length input. We discuss different architectures and expose their limitations. We propose a new architecture that is suitable for continuous speech recognition. We study error-correcting output code (ECOC) and improve existing bounds on the error rate of the multiclass classifier given the average binary distortion. We also present experimental results comparing several ECOC schemes, which bring new insights on ECOC performance. The proposed hybrid framework allows for adopting a heterogeneous set of features. When using only 25 selected features per SVM, the accuracy of the heterogeneous set was higher than when using a homogeneous set of 118 features based on a standard PLP front end. We also propose an algorithm to discriminatively train Gaussian mixture models (GMM), which is based on the extended Baum-Welch used for maximum mutual information estimation (MMIE) in ASR. Using this algorithm, we train a discriminative GMM classifier (DGMM) and compare its accuracy and sparsity to the ones obtained with kernel classifiers, such as SVM and the relevance vector machine.