Neural Network Bottleneck Features for Language Identification

This paper presents the application of Neural Network Bottleneck (BN) features in Language Identification (LID). BN f eatures are generally used for Large Vocabulary Speech Recognition in conjunction with conventional acoustic features, s uch as MFCC or PLP. We compare the BN features to several common types of acoustic features used in the state-of-the-art LID systems. The test set is from DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state-of-the-art detection capabilities on audio from hig hly degraded radio communication channels. On this type of noisy data, we show that in average, the BN features provide a 45% relative improvement in the Cavgor Equal Error Rate (EER) metrics across several test duration conditions, with resp ect to our single best acoustic features. Index Terms: language identification, noisy speech, robust feature extraction

[1]  Yan Song,et al.  i-vector representation based on bottleneck features for language identification , 2013 .

[2]  Daniel P. W. Ellis,et al.  Autoregressive Modeling of Temporal Envelopes , 2007, IEEE Transactions on Signal Processing.

[3]  Bin Ma,et al.  Shifted-Delta MLP Features for Spoken Language Recognition , 2013, IEEE Signal Processing Letters.

[4]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[5]  Yun Lei,et al.  Adaptive Gaussian backend for robust language identification , 2013, INTERSPEECH.

[6]  Mireia Díez,et al.  On the use of phone log-likelihood ratios as features in spoken language recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[7]  Lukás Burget,et al.  Simplification and optimization of i-vector extraction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Hynek Hermansky,et al.  Static and dynamic modulation spectrum for speech recognition , 2009, INTERSPEECH.

[10]  Hynek Hermansky,et al.  Improvements in language identification on the RATS noisy speech corpus , 2013, INTERSPEECH.

[11]  William M. Campbell,et al.  Language recognition with support vector machines , 2004, Odyssey.

[12]  Kyu Jeong Han,et al.  Frame-based phonotactic Language Identification , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[13]  Spyridon Matsoukas,et al.  Patrol Team Language Identification System for DARPA RATS P1 Evaluation , 2012, INTERSPEECH.

[14]  Qi Li,et al.  Robust speaker identification using an auditory-based feature , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Lukás Burget,et al.  Investigation into bottle-neck features for meeting speech recognition , 2009, INTERSPEECH.

[16]  Lukás Burget,et al.  Regularized subspace n-gram model for phonotactic ivector extraction , 2013, INTERSPEECH.

[17]  J Tchorz,et al.  A model of auditory perception as front end for automatic speech recognition. , 1999, The Journal of the Acoustical Society of America.

[18]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[19]  John H. L. Hansen,et al.  Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Ramesh A. Gopinath,et al.  Short-time Gaussianization for robust speaker verification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[22]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[23]  Martin Karafiát,et al.  Convolutive Bottleneck Network features for LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[24]  Richard M. Schwartz,et al.  Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.