GMM-based acoustic modeling for embedded speech recognition

Speech recognition applications are known to require a significant amount of resources (training data, memory, computing power). However, the targeted context of this work - mobile phone embedded speech recognition system - only authorizes few KB of memory, few MIPS and usually small amount of training data. In order to fit the resource constraints, an approach based on a semi-continuous HMM system using a GMM-based stateindependent acoustic modeling is proposed in this paper. A transformation is computed and applied to the global GMM in order to obtain each of the HMM state-dependent probability density functions. This strategy aims at storing only the transformation function parameters for each state and authorizes to decrease the amount of computing power needed for the likelihood computation. The proposed approach is evaluated on two tasks: a digit recognition task using the French corpus BDSON (which allows a Digit Error Rate of 2.5%) and a voice command task using French corpus VODIS (the Command Error Rate leads around 4.1%). Index Terms: embedded speech recognition, acoustic modeling.

[1]  Hanseok Ko,et al.  Compact acoustic model for embedded implementation , 2004, INTERSPEECH.

[2]  Georges Linarès,et al.  Embedded Mobile Phone Digit-Recognition , 2007 .

[3]  Xuedong Huang,et al.  On semi-continuous hidden Markov modeling , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[4]  Georges Linarès,et al.  Structural linear model-space transformations for speaker adaptation , 2003, INTERSPEECH.

[5]  Steve Young,et al.  The general use of tying in phoneme-based HMM speech recognisers , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[7]  Josef G. Bauer,et al.  High performance speaker and vocabulary independent ASR technology for mobile phones , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  Petra Geutner,et al.  VODIS - voice-operated driver information systems: a usability study on advanced speech technologies for car environments , 2000, INTERSPEECH.

[9]  Paul Dalsgaard,et al.  Digital Signal Processing for In-Vehicle and Mobile Systems 2 , 2007 .

[10]  Maxine Eskénazi,et al.  BREF, a large vocabulary spoken corpus for French , 1991, EUROSPEECH.

[11]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  M. Eskenazi,et al.  The French language database: Defining, planning, and recording a large database , 1984, ICASSP.