论文信息 - Neural Network Bottleneck Features for Language Identification

Neural Network Bottleneck Features for Language Identification

This paper presents the application of Neural Network Bottleneck (BN) features in Language Identification (LID). BN f eatures are generally used for Large Vocabulary Speech Recognition in conjunction with conventional acoustic features, s uch as MFCC or PLP. We compare the BN features to several common types of acoustic features used in the state-of-the-art LID systems. The test set is from DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state-of-the-art detection capabilities on audio from hig hly degraded radio communication channels. On this type of noisy data, we show that in average, the BN features provide a 45% relative improvement in the Cavgor Equal Error Rate (EER) metrics across several test duration conditions, with resp ect to our single best acoustic features. Index Terms: language identification, noisy speech, robust feature extraction

[1] Yan Song,et al. i-vector representation based on bottleneck features for language identification , 2013 .

[2] Daniel P. W. Ellis,et al. Autoregressive Modeling of Temporal Envelopes , 2007, IEEE Transactions on Signal Processing.

[3] Bin Ma,et al. Shifted-Delta MLP Features for Spoken Language Recognition , 2013, IEEE Signal Processing Letters.

[4] Patrick Kenny,et al. Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[5] Yun Lei,et al. Adaptive Gaussian backend for robust language identification , 2013, INTERSPEECH.

[6] Mireia Díez,et al. On the use of phone log-likelihood ratios as features in spoken language recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[7] Lukás Burget,et al. Simplification and optimization of i-vector extraction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9] Hynek Hermansky,et al. Static and dynamic modulation spectrum for speech recognition , 2009, INTERSPEECH.

[10] Hynek Hermansky,et al. Improvements in language identification on the RATS noisy speech corpus , 2013, INTERSPEECH.

[11] William M. Campbell,et al. Language recognition with support vector machines , 2004, Odyssey.

[12] Kyu Jeong Han,et al. Frame-based phonotactic Language Identification , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[13] Spyridon Matsoukas,et al. Patrol Team Language Identification System for DARPA RATS P1 Evaluation , 2012, INTERSPEECH.

[14] Qi Li,et al. Robust speaker identification using an auditory-based feature , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15] Lukás Burget,et al. Investigation into bottle-neck features for meeting speech recognition , 2009, INTERSPEECH.

[16] Lukás Burget,et al. Regularized subspace n-gram model for phonotactic ivector extraction , 2013, INTERSPEECH.

[17] J Tchorz,et al. A model of auditory perception as front end for automatic speech recognition. , 1999, The Journal of the Acoustical Society of America.

[18] Douglas A. Reynolds,et al. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[19] John H. L. Hansen,et al. Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Ramesh A. Gopinath,et al. Short-time Gaussianization for robust speaker verification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21] David Talkin,et al. A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[22] Spyridon Matsoukas,et al. Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[23] Martin Karafiát,et al. Convolutive Bottleneck Network features for LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[24] Richard M. Schwartz,et al. Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.