Embedded Mobile Phone Digit-Recognition

Speech recognition applications are known to require substantial amount of resources in terms of training data, memory and computing power. However, the targeted context of this work — embedded mobile phone speech recognition systems — only authorizes few KB of memory, few MIPS and usually a small amount of training data. In order to meet the resource constraints, an approach based on an HMM system using a GMM-based state-independent acoustic modeling is proposed in this paper. A transformation is computed and applied to the global GMM in order to obtain each of the HMM state-dependent probability density functions. This strategy aims at storing only the transformation function parameters for each state and enables to decrease the amount of computing power needed for the likelihood computation. The proposed approach is evaluated with a digit recognition task using the French corpus BDSON. Our method allows a Digit Error Rate (DER) of 2.1%, when the system respects the resource constraints. Compared to a standard HMM with comparable resources, our approach achieved a relative DER decrease of about 52%.

[1]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[2]  Maxine Eskénazi,et al.  BREF, a large vocabulary spoken corpus for French , 1991, EUROSPEECH.

[3]  Daniel Povey,et al.  Large scale discriminative training for speech recognition , 2000 .

[4]  Jean-Claude Junqua,et al.  Gaussian dynamic warping (GDW) method applied to text-dependent speaker detection and verification , 2003, INTERSPEECH.

[5]  Hanseok Ko,et al.  Compact acoustic model for embedded implementation , 2004, INTERSPEECH.

[6]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  M. Eskenazi,et al.  The French language database: Defining, planning, and recording a large database , 1984, ICASSP.

[8]  Georges Linarès,et al.  Structural linear model-space transformations for speaker adaptation , 2003, INTERSPEECH.

[9]  Josef G. Bauer,et al.  High performance speaker and vocabulary independent ASR technology for mobile phones , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[10]  Steve Young,et al.  The general use of tying in phoneme-based HMM speech recognisers , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.