Factor analysis based session variability compensation for Automatic Speech Recognition

In this paper we propose a new feature normalization based on Factor Analysis (FA) for the problem of acoustic variability in Automatic Speech Recognition (ASR). The FA paradigm was previously used in the field of ASR, in order to model the usefull information: the HMM state dependent acoustic information. In this paper, we propose to use the FA paradigm to model the useless information (speaker- or channel-variability) in order to remove it from acoustic data frames. The transformed training data frames are then used to train new HMM models using the standard training algorithm. The transformation is also applied to the test data before the decoding process. With this approach we obtain, on french broadcast news, an absolute WER reduction of 1.3%.

[1]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[2]  Martin Westphal,et al.  The use of cepstral means in conversational speech recognition , 1997, EUROSPEECH.

[3]  Georges Linarès,et al.  A simplified Subspace Gaussian Mixture to compact acoustic models for speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[5]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[6]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[7]  Mark J. F. Gales,et al.  Canonical state models for automatic speech recognition , 2010, INTERSPEECH.

[8]  Louis D. Braida,et al.  Human and machine consonant recognition , 2005, Speech Commun..

[9]  Kai Feng,et al.  Subspace Gaussian Mixture Models for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Driss Matrouf,et al.  A straightforward and efficient implementation of the factor analysis model for speaker verification , 2007, INTERSPEECH.

[11]  Paul Deléglise,et al.  Unsupervised model adaptation on targeted speech segments for LVCSR system combination , 2010, INTERSPEECH.

[12]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  Georges Linarès,et al.  The LIA Speech Recognition System: From 10xRT to 1xRT , 2007, TSD.