TRAP language identification system for RATS phase II evaluation

Automatic language identification or detection of audio data has become an important preprocessing step for speech/speaker recognition and audio data mining. In many surveillance applications, language detection has to be performed on highly degraded audio inputs. In this paper, we present our work on language detection in highly degraded radio channel scenarios. We provide a brief description of the Targeted Robust Audio Processing (TRAP) language detection system builtfor the Phase II Evaluationof the RobustAutomatic Transcription of Speech (RATS) program. This system is a combination of 15 systems with different frontends and speech activity decisions. We also analyze the usefulness of multi-layer perceptron (MLP) based non-linear projection of i-vectors before SVM classification. The proposed backend reduces the Equal Error Rate (EER) by 11%–25% relative compared to the baseline PCA-based feature representation for SVM classification, on the RATS test data consisting of data from eight highfrequency radio communication channels. Index Terms: Language identification (detection), highly degraded radio channel, RATS, i-vector, multi-layer perceptron.

[1]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[3]  Brian Kingsbury,et al.  Improvements to the IBM speech activity detection system for the DARPA RATS program , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yin-Wen Chang,et al.  Low-degree Polynomial Mapping of Data for SVM , 2009 .

[5]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[6]  Hynek Hermansky,et al.  Feature extraction using 2-d autoregressive models for speaker recognition , 2012, Odyssey.

[7]  Hynek Hermansky,et al.  Qualcomm-ICSI-OGI features for ASR , 2002, INTERSPEECH.

[8]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[9]  Daniel P. W. Ellis,et al.  Autoregressive Modeling of Temporal Envelopes , 2007, IEEE Transactions on Signal Processing.

[10]  DeLiang Wang,et al.  Robust speaker identification using auditory features and computational auditory scene analysis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Mohamed Kamal Omar,et al.  On the Use of Non-Linear Polynomial Kernel SVMs in Language Recognition , 2012, INTERSPEECH.

[12]  Kevin Walker,et al.  The RATS radio traffic collection system , 2012, Odyssey.

[13]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[14]  Mohamed Omar Speech Activity Detection for Noisy Data Using Adaptation Techniques , 2012, INTERSPEECH.

[15]  Sridhar Krishna Nemala,et al.  A Multistream Feature Framework Based on Bandpass Modulation Filtering for Robust Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[19]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[20]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[21]  Spyridon Matsoukas,et al.  Patrol Team Language Identification System for DARPA RATS P1 Evaluation , 2012, INTERSPEECH.

[22]  Andreas Stolcke,et al.  Generalized Linear Kernels for One-Versus-All Classification: Application to Speaker Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23]  Chih-Jen Lin,et al.  Training and Testing Low-degree Polynomial Data Mappings via Linear SVM , 2010, J. Mach. Learn. Res..

[24]  B. Kollmeier,et al.  Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. , 2012, The Journal of the Acoustical Society of America.