Feature extraction and acoustic modeling: an approach for improved generalization across languages and accents

The paper proposes a solution that brings some advances to the genericity of the ASR technology towards tasks and languages. A non-linear discriminant model is built from multi-lingual, multi-task speech material in order to classify the acoustic signal into language independent phonetic units. Instead of considering this model for direct HMM state likelihood estimation, it rather operates as a first stage to produce discriminant features that can be further used in cascade with a traditional task/language specific ASR system. This first stage structure is expected to achieve a strong modeling of the cross-language variability of speech that can better handle pronunciation variations due for instance to regional and non-native accents. Moreover, the flexibility of this architecture still allow the development of small task/language dedicated ASR systems as a second stage structure, possibly with small amount of data. The benefit of this architecture is demonstrated through a fine analysis of modeling performance at the phoneme level and on two different isolated word recognition tasks featuring accent variabilities

[1]  Albino Nogueiras,et al.  A first experience on multilingual acoustic modeling of the languages spoken in morocco , 2004, INTERSPEECH.

[2]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[3]  Jean-Marc Boite,et al.  A study of implicit and explicit modeling of coarticulation and pronunciation variation , 2005, INTERSPEECH.

[4]  Christophe Ris,et al.  Robust feature extraction and acoustic modeling at multitel: experiments on the Aurora databases , 2003, INTERSPEECH.

[5]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[6]  Isabel Trancoso,et al.  Recognition of non-native accents , 1997, EUROSPEECH.

[7]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[8]  Tanja Schultz,et al.  Experiments on cross-language acoustic modeling , 2001, INTERSPEECH.

[9]  Li Deng,et al.  Production models as a structural basis for automatic speech recognition , 1997, Speech Commun..

[10]  György Szaszák,et al.  The COST 278 MASPER Initiative - Crosslingual Speech Recognition with Large Telephone Databases , 2004, LREC.

[11]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[12]  Albino Nogueiras,et al.  Data driven multidialectal phone set for Spanish dialects , 2004, INTERSPEECH.

[13]  Jean-Marc Boite,et al.  Nonlinear discriminant analysis for improved speech recognition , 1997, EUROSPEECH.

[14]  Anja Geumann,et al.  Towards a new level of anotation detail of multilingual speech corpora , 2004, INTERSPEECH.