Multiple time resolution analysis of speech signal using MCE training with application to speech recognition

In this paper, we propose two methods of multiple time-resolution analysis of speech and their application to Automatic Speech Recognition (ASR). Constant frame-rate multi-scale analysis is proposed based on a box of multi-scale features. Then a variable rate analysis is proposed based on the selection of the optimal temporal resolution on the fly by a properly trained non-linear classifier unit. The classifier's parameters are trained using the discriminative method of Minimum Classification Error (MCE) training. We use the recently proposed Conditional Random Fields (CRF) phonetic recognition system that effectively combines highly correlated features. Results are reported on a frame-wise classification task and also on TIMIT phone recognition task. Results show that (i) CRFs can effectively combine multi-scale features and (ii) MCE trained variable rate CRFs are competitive with the “box” combination method.